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Abstract 

In this thesis we explore means by which hypercubes can compute despite faulty proces- 
sors and links. We also study techniques which enable hypercubes to simulate dynamically 
changing networks and data structures. 

In chapter two, we investigate strategies for routing permutations on faulty hypercubes. 
We assume that each node or edge in the hypercube fails with fixed probability p < 1- \/l/2 
and that failures are independent of one another. We describe a constant c > and a routing 
algorithm which successfully routes messages between working processors in O(logiV) steps 
on an JV-node faulty hypercube, with probability 1 - N~ c . We also strengthen an algorithm 
due to Rabin which uses a redundant encoding of each message into log N pieces which are 
routed along node- disjoint paths. A destination can reconstruct the original message as long 
as at least logiV/2 pieces arrive intact. We show that all messages are reconstructable at 
their destinations with high probability, given that each node or edge fails with probability 
0(1/ log N) and that each message has fi(log 2 iV) bits. This guarantee obtains even if the 
components fail during the course of the algorithm. 

In chapter three, we develop techniques for reconfiguring hypercubes in the presence of 
faults. Again assuming constant probabilities of failure and the independence of faults, we 
show that a faulty hypercube can simulate a fault-free hypercube of the same size with only 
constant delay. We exhibit both deterministic and randomized algorithms for hypercube 
reconfiguration. We show that there exists a constant c' > such that with probability 
1 _ N~ c ' the deterministic algorithm finds a one-to-one embedding with dilation 3 and 
O(logiV) congestion. We also show that there exists a constant c" > such that with 
probability 1 - N~ c " the randomized algorithm finds an embedding with constant load and 
congestion with dilation 5. 

In chapter four, we turn our attention to the embedding of dynamically growing data 
structures in the hypercube. Specifically, we show that an arbitrarily growing binary tree 
with a maximum of M nodes can be embedded in an JV-node hypercube with load 0(-^ + 1), 
congestion 0(# + 1) and dilation 12, with high probability. We also show how to embed a 
dynamic Af -node binary tree in an iV-node butterfly with 0($ + logiV) load and dilation 
2, with high probability. 
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Chapter 1 



Introduction 



1.1 Hypercubes 

The hypercube has emerged as one of the most effective and popular network archi- 
tectures for large scale parallel computers. The Connection Machine, manufactured 
and sold by Thinking Machines Corp., is a hypercube-based machine containing 2 16 
processing elements. Machines based on hypercube architectures have been built by 
Intel, Ncube, Caltech and others. It has been predicted that in the not-too-distant 
future, hypercube-based machines containing up to a million processors will be avail- 
able. Thus, current conditions point to the utility of more advanced methods for 
hypercube computation. 

The n-dimensional hypercube H n is a graph with N = 2 n nodes and Nn/2 edges. 
The nodes of H n are labeled with n-bit binary strings, and two nodes are linked by 
an edge if the associated strings differ in precisely one bit. If the differing bit is in the 
i th position (1 < i < n) then the associated edge is called a dimension i edge. The 
neighbor of a node v across the i th dimension will be denoted by v l . Similarly u' 1 * 8 -'* 
will denote the node reached from v by traversing dimensions »i,»2,.. .,»* (that is, 
by flipping those bits). We will use n and log N interchangeably. Pictures of labeled 
two and three dimensional hypercubes and an unlabeled four dimensional hypercube 
appear in figures 1-1 and 1-2. 

In hypercube-based machines, the nodes of the graph are replaced by processors 
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Figure 1-1: Labeled 2- and 3-dimensional hypercubes. 




Figure 1-2: A 4-dimensional hypercube. 



and the edges are replaced by links between the processors. For example, in the 
Connection Machine each node of a 12-dimensional hypercube contains a group of 16 
processors. 

The effectiveness of the hypercube for parallel computation arises from the wealth 
of special-purpose algorithms written for it, its support of algorithms written for 
shared-memory machines and its ability to simulate a host of other networks. Many 
algorithms which run quickly on the hypercube already exist. Further, the hyper- 
cube's recursive structure and high connectivity make it likely that fast hypercube 
algorithms will continue to be invented in other contexts. 

Hypercubes have demonstrated their usefulness as general-purpose computers as 
well. Fast routing algorithms ([VB], [Ran], [P]) allow for low-overhead interprocessor 
communication. These algorithms enable the hypercube to simulate a parallel random 
access machine, or PRAM, with only logarithmic delay. Since any set of messages 
are deliverable in 0(log N) time, each set of memory accesses can be simulated in 
0(log N) time as well, even if the PRAM's processors and memory locations are 
spread arbitrarily among the hypercube's processors. 

Hypercubes perform even more admirably when simulating special-purpose net- 
works. The hypercube can simulate meshes, multidimensional arrays, binary trees, 
x-trees, pyramid graphs, butterflies, cube-connected cycles and other networks, all 
with constant delay. In many cases, these other networks are actually subgraphs of 
the cube. In these instances, the hypercube can simulate the special-purpose network 
with no delay at all. 

1.2 Robustness 

In this thesis we will describe three ways in which the hypercube is robust in a chang- 
ing computational environment. Specifically, we show how the hypercube can support 
fault- tolerant routing, how the hypercube can be easily reconfigured in the presence 
of faults and how the hypercube can handle dynamically changing load requirements. 
In the first two cases, the network itself changes due to the accumulation of faulty 
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processors and links. We show how the network can absorb these faulty components 
while exhibiting little or no degradation of performance. In the third case, the com- 
putation we expect the network to perform changes in accordance with the data in an 
unpredictable fashion. We show how the network can distribute the resulting com- 
putational load as optimally as if it had been completely specified beforehand. In all 
three cases, a probabilistic approach helps us to achieve our results. In some cases, 
we prove that these results would be impossible if randomness were not available. 

In chapters two and three, we explore fault tolerant properties of the hypercube. 
We assume that each node or edge has some constant probability of failure. In chapter 
two we exhibit two randomized algorithms for routing permutations on hypercubes 
in the presence of faulty components. Both algorithms are based on Valiant and 
Brebner's ([VB]) original randomized algorithm for routing permutations on hyper- 
cubes. In the first algorithm, we modify the fault-free algorithm so that messages 
avoid faults. In the second algorithm, packets are broken into pieces containing re- 
dundant information. Since only a constant fraction of the pieces need to get through 
to reconstruct the original packet, the algorithm can tolerate the loss of many pieces 
due to faults. To route a permutation, neither algorithm takes more than a constant 
factor more time than is required to route without faults. 

Chapter three is devoted to reconfiguration algorithms. The effect of these al- 
gorithms is that the nonfaulty processors of a hypercube with faults simulate the 
processors of a completely functioning hypercube. The link connecting two proces- 
sors in the completely functioning hypercube appears as a functioning path between 
the nodes simulating them in the cube with faults. In chapter three, we describe 
reconfiguration algorithms which enable a hypercube with many faults to compute as 
efficiently as a hypercube of the same size without faults. 

The efficient simulation of dynamically evolving computation structures is the 
subject of chapter four. We show that a hypercube can simulate an arbitrarily growing 
binary tree with only constant overhead. As the tree evolves, new nodes are assigned 
to hypercube processors. Neighbors in the tree are simulated by hypercube processors 
only a constant distance apart. For any tree, the randomized algorithm assigns only 
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a constant number of tree nodes to each processor with a probability that can be 
made arbitrarily close to 1. Thus both computation and communication overhead 
are minimized. 

In sections 1.3 - 1.5, we give an overview of the results in each of chapters two, 
three and four. 

1.3 Fault- Tolerant Routing 

Given a network with a large number of components, we must assume that some of 
these components will fail. These faults may be introduced when the machine is first 
built, or might accumulate over time. We would like the machine to work despite the 
faults. 

Currently, when a processor or connection in the Connection Machine fails, the 
board containing the offending component is removed and replaced with a functional 
board. At some point in the future, if and when very large machines are in gen- 
eral use, fault- tolerant algorithms may well provide a viable alternative to wholesale 
replacement. Such algorithms might enable the machine to correct itself, with no 
outside intervention. 

Fault-tolerant behavior will be a major focus of our work. Routing in the presence 
of faults, which we study in chapter two, requires techniques for either stepping around 
faults or coping with messages which run into faults. Attempts have been made on 
both of these fronts. We consider a routing algorithm successful if every packet sent 
from a working processor to another working processor arrives intact. Of course, this 
view presupposes that the higher-level algorithm in effect is also tolerant of faulty 
processors. For example, a PRAM algorithm would have to tolerate some pattern of 
faults among the PRAM's processors. Such algorithms have yet to be designed. 

Throughout chapter two, we assume that there is some fixed probability p (either 
a constant or a function of the number N of nodes in the network) such that each 
component of the hypercube fails with probability p. Furthermore, we will assume 
that the failure of any given component is independent of the status of other parts 
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of the network. In some cases, this independence assumption may be unreasonable. 
Components which share a physical location such as a chip or a board might have a 
greater chance of failing in tandem. In this situation, our results can scale to work 
in a hierarchical fashion. We may regard any hypercube as a hypercube whose nodes 
are themselves hypercubes (a cross product of hypercubes). Thus we may treat the 
chips or boards as nodes in a more coarse-grained hypercube. 

Many of our algorithms are randomized as well. These algorithms have access 
to a source of randomness and we only guarantee that they achieve desired results 
an overwhelmingly large fraction of the time. Specifically, we guarantee that each 
algorithm succeeds with probability at least 1 — N~ k ; i.e. that each fails with a 
probability that is an inverse polynomial in N. If we can make the exponent k as 
large as we like (perhaps by relaxing constants in the performance we desire), then 
we say that the algorithm succeeds with high probability. 

In [VB], Valiant and Brebner define a set of paths from sources to destinations 
which, with high probability, allow all packets to arrive at their destinations in 
O(logiV) steps. Two different variations on Valiant and Brebner's ideas allow us 
to route in the presence of faults. These variations use different assumptions about 
the prevalence of faults, the capability of processors to detect faults in neighboring 
components, and the minimum size of the packets that we can route. In the first case, 
we assume that faults occur independently and with constant probability p, that each 
processor can detect in one time unit whether or not an adjacent node or link has 
failed, and that messages have length ft(log N). Our idea is for packets to follow close 
to the paths defined in [VB], but loosely enough that they can avoid faults as they 
encounter them. We show that if each packet avoids faults by taking random steps 
away from its Valiant-Brebner path, then with high probability each packet uses a 
path with only 0(log N) edges and encounters only 0(log N) other packets on its 
path. This shows that each packet arrives at its destination in O(logiV) steps. 

We devote the second half of chapter two to our improvements of an idea of Rabin 
([R]). In this case, we assume that each edge of the hypercube fails with probability 
p = 0(1/ log 2 N), that processors remain ignorant of changes in the topology of the 
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network, and that packets have size O,(log 2 N). Under these assumptions, Rabin 
showed that if each packet is split into log N pieces and the pieces are routed to the 
packet's original destination by node-disjoint paths, then a constant fraction of each 
packet's pieces will arrive intact at the destination. This assumes that each piece 
makes no attempt to avoid faults. A piece arrives at the destination if and only if 
no faults lie on its path. Coupled with a method for recovering a packet from a 
constant fraction of its pieces, this strategy allows us to choose paths as if the faults 
were not there. We describe a very simple way to choose the paths — we use log N 
paths parallel to the Valiant-Brebner path. We are then able to simplify the proof, 
to allow node failures as well, and to increase the allowable failure rates to include 
probabilities as high as p — 0(1 /n). (Recently, Giladi has reported similar results 
([G]).) 

1.4 Reconfiguration 

Network reconfiguration involves assigning to working components the tasks that 
the failed components would otherwise perform. The goal is to leave the network's 
processing power undiminished in the eyes of the outside world, except perhaps for 
a minor slowdown in speed to allow some components to perform multiple duty. 
Alternatively, we can view reconfiguration as the embedding of a fault-free network 
H' n of the same size into the working parts of the faulty network H n . We can show that 
even if a constant fraction of the hypercube's processors and links fail, what remains 
keeps the original cube's processing power with only a constant factor degradation in 
speed, with high probability. 

We make the same probabilistic fault assumptions in chapter three that we made 
in chapter two. Each component fails with constant probability and independently 
of other components. 

Some of our techniques may be of use with other hypercube- related problems. In 
particular, there is one simple observation that is used in two forms in section 3.5. 
Although the observation has probably been made by others, it is basic enough that 
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we think it worth highlighting as a paradigm for distributed match-making. 

We will describe the result in its most basic form. Consider a collection of Q(N) 
men and Q(N) women at a dance. Assume that each man has at least 0(X) female 
friends and that each woman has at most O(X) male friends. By Hall's marriage 
theorem, it is possible to schedule 0(1) rounds of dances so that every man dances 
with at least one friend and every woman dances at most 0(1) times. Unfortunately, 
the problem of scheduling dance partners requires substantial global coordination. 
For our purposes, we focus on a scenario where pairing is accomplished simply by 
a man asking a woman to dance. If many men ask a woman to dance at once, she 
accepts as many as she can, making sure not to exceed her capacity of C = 0(1) 
dances for the evening. If she can only accept some of the men, she prefers the 
tallest among them. Each man chooses a friend randomly for each dance (without 
knowledge of which women are tired or which women other men are asking) until he 
dances. The result (which we call the Dance Hall Theorem — pun intended) is that if 
X = fl(logiV), and there are fl(logiV) dances, then with high probability every man 
will dance during the course of the evening. That is, for any lower bound bX on the 
number of female friends each man has, any upper bound b'X on the number of male 
friends each woman has and any constant k, there is a C such that for sufficiently 
large N, with probability 1 — N~ k a capacity of C is sufficient. 

The Dance Hall Theorem scenario first arises in our analysis when we attempt to 
embed the nodes of H' n in the functioning nodes of H n . The nodes of H' n correspond 
to men and the functioning nodes of H n correspond to women. If a man dances with 
a woman, then the corresponding node of H' n will be simulated by the corresponding 
node of H n . We need the Dance Hall Theorem to ensure that the load of the em- 
bedding is 0(1) (i.e. every woman dances with 0(1) men) and to ensure that the 
embedding can be constructed quickly with local control (no global matchmaker). 
We also need some other as-yet-undescribed properties of the Dance Hall Theorem 
schedule to ensure that the hypercube's edges are not overtaxed by the embedding, 
but these are more technical in nature and will be dealt with in the main text. 
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1.5 Dynamic Load Balancing 

The desire for the optimal use of computational resources is often modelled as an 
embedding problem. We construct a graph whose nodes represent the data and pro- 
cesses. An edge connects two processes which trade information. To minimize com- 
putation time, we would like to divide the processing requirements evenly among the 
processors of our network. To minimize communication time, we would like to assign 
neighboring processes to processors which are fairly close. These two requirements 
may conflict. 

For one solution, we might build a network which perfectly mirrors the processes 
involved and embed each process in its own processor. There are two problems with 
this approach. First, every algorithm would require a different network structure 
depending upon how it divided up the work. Worse, the same algorithm might 
generate a different process graph for different input data. In this case no foresight 
could help in network construction. One (far from unique) example can be found in 
the context of branch-and-bound algorithms. The search tree developed during each 
run of a branch-and-bound algorithm changes based on which subtrees are cut and 
which are chosen for further exploration. We could not hope to build a processor tree 
which could handle all potentialities unless it were far larger than any one tree that 
might be generated during any particular run. 

As a second solution, we might build a network into which all similarly sized trees 
can be embedded. A practical network would allow us to embed a tree dynamically. 
As we embed the tree, we have no knowledge of which branches will develop many 
nodes in the future, and which will cease to exist at all. We must allow sufficient 
room for all possibilities. 

In chapter four, we demonstrate a randomized algorithm which, with high prob- 
ability, embeds an arbitrary dynamic binary tree in a hypercube so that the compu- 
tation and communication overhead are both constant. A simplified version of the 
algorithm embeds a dynamically growing tree in a butterfly smaller by a logarithmic 
factor. Both computation and communication are slowed by only a logarithmic fac- 
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Chapter 2 



Routing in the Presence of Faults 



2.1 Introduction 

To successfully simulate shared memory, a parallel network must have the ability to 
route information between different origin processors and destination processors at 
the same time. Since processors trade information throughout the course of parallel 
computations, the overhead due to the transmission of information over the network 
shows up as a multiplicative factor in the time to perform many tasks. Thus the 
routing question is one of fundamental importance. 

In practice and theory, the store-and-forward model of communication is often 
used. In this model, once a node begins transmission of a message unit across a 
link, it continues to transmit until the entire message is sent. Treating messages as 
inviolable packets allows us to ignore some significant issues of control at the cost of 
time. Since time bounds for packet-switched networks are often stated in units of 
packet steps, such bounds must be multiplied by the length of the longest message to 
produce a bound in bit steps. 

Many algorithms have appeared for routing on hypercubes and networks derived 
from hypercubes (such as the butterfly). In 1981, Valiant and Brebner ([VB]) pre- 
sented an algorithm for routing J) (log iV)-bit packets on the log N x JV-node butterfly 
(and hence the iV-node cube) which could route permutations from the top level to 
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the bottom in O(logjV) packet steps, with high probability. 1 Here a permutation 
means that the mapping from origins to destinations is bijective. Their algorithm, 
which we will review in section 2.2, introduced the paradigm of routing each packet 
first to a random intermediate destination and then to its true final destination. The 
algorithm routes obliviously: each packet's path is chosen without regard for the 
paths of any other packets. 

This simple addition of randomness is enough to overcome the proven delays 
involved with deterministic routing algorithms. Borodin and Hopcroft ([BH]) showed 
that any deterministic oblivious algorithm must necessarily take 17(\/iV/(logiV) 3/2 ) 
bit steps, in the worst case, for any iV-node network. 

Since Valiant and Brebner's pioneering work, significant improvements have been 
made. Pippenger ([P]) showed how to route permutations of a fully loaded log N x N 
butterfly in 0(log iV) steps with high probability. That is, each node in the butterfly 
can generate a packet, not only the nodes in the top level. In Pippenger 's algorithm, 
only a constant number of packets reside in a queue at any time. Ranade ([Ran]) 
produced an algorithm which routes arbitrary mappings on a fully-loaded butterfly 
using combining, again with constant size queues and in O(log N) packet steps with 
high probability. Both of these algorithms make fundamental use of the paradigm of 
routing to random intermediate destinations. 

2.1.1 Summary of Results 

In this chapter, we consider the problem of packet routing on a hypercube with 
faults. We assume that every node and link of a hypercube fails independently with 
constant probability p. Under this assumption, with probability exponentially close 
to 1, a constant fraction of the components of the cube will fail. In the presence 
of such a large number of faults, we would like to route packets so that any packet 
generated by a working node and sent to a working node arrives safely within the 
stated time bound. 



J We use the phrase Q is less than 0{g) with high probability to mean "For every * there exists 
a constant d independent of N such that the probability that Q exceeds dg is less than N~ k ." 
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We describe and analyze a randomized packet routing algorithm that adaptively 
routes packets around faults as they are encountered in an A^-node hypercube that 
contains Q(N) randomly located faulty nodes and Q(N\ogN) randomly located 
faulty edges. We assume that each processor can decide if an adjacent node or link 
has failed. Also, each processor can choose a random element from a set with as 
many as log N elements according to the uniform distribution. We define the prop- 
erty of local routability, a characterization of the connectivity of the network after 
some components have failed. There exists a constant Ct such that the hypercube 
remains locally routable with probability 1 - N~ C1 . We prove that, given that the 
hypercube is locally routable, the algorithm routes any permutation on the working 
processors in O(log N) steps with high probability. That is, under the assumption of 
local routability, we reproduce Valiant and Brebner's results in the presence of faulty 
components. Packets which start or end at faulty nodes are eventually determined 
to be undeliverable. All the deliverable packets arrive at their destinations provided 
that they are not located in the immediate vicinity of a processor at the moment 
it fails. The algorithm is fault- tolerant in the sense that no advance knowledge of 
the locations of the faults is needed for the path selection, but it is susceptible to 
nodes which fail while holding packets. The algorithm is of interest because during 
most steps, few processors will fail and almost all deliverable packets will be deliv- 
ered. In addition, the algorithm itself is quite simple and is the first adaptive routing 
algorithm for which an O(log N) bound on the routing time has been achieved. 

Work on adaptive routing for faulty hypercubes is potentially applicable out- 
side the setting of fault- tolerance. Except for the algorithm we present, all known 
0(log N) packet step routing algorithms for the hypercube are inherently nonadap- 
tive. Whereas fi(logiV) packet steps are also a lower bound on the time to route 
(since the diameter of the hypercube is log N), the implied O(log 2 N) bit step bound 
for 0(log 7V)-size packets is not provably optimal. Recently, we have proven a lower 
bound of fi(log 2 N/ log log N) bit steps for all nonadaptive algorithms ([ALN]). Thus, 
serious improvement on the upper bound will have to come from an adaptive algo- 
rithm. 
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There has been other work on packet routing on faulty hypercubes. Most no- 
tably, Rabin ([R]) has devised an elegant scheme called information dispersal routing 
wherein each packet to be routed is decomposed into logN pieces. The pieces are 
routed in a randomized nonadaptive fashion to their destinations and then recom- 
bined to form the original message. A key aspect of the scheme is that the packet 
decomposition uses error-correcting codes. Therefore only a constant fraction of the 
pieces of any packet need to get through to the destination for the packet to be 
reconstructed. 

Rabin makes different assumptions about both the nature of fault detection and 
the size of the packets. His model assumes no detection of nearby faults is possible. 
In his algorithm, each node chooses log N node-disjoint paths on which to send its 
pieces without regard for faults they may contain. If a packet encounters a fault, it 
is lost. Rabin's scheme is useful only if the original packets represent relatively long 
bit streams. Because routing information alone uses 0(logiV) bits, each of the log N 
pieces into which a packet is divided must contain Cl(\ogN) bits. Thus the original 
packets must have length fi(log 2 N). Additionally, Rabin's analysis depends on the 
failure rate p to be 0(1/ log 2 N) and allows only edge faults. At most 0(iV/ log N) 
edge faults can be absorbed. Under these conditions, the Rabin algorithm provides a 
fully fault- tolerant routing of N packets in 0(\ogN) steps with high probability. 

In section 2.3, we show how to achieve Rabin's results with a simpler algorithm 
and analysis. Our analysis permits both node and edge faults and requires p to 
be 0(1/ log N) so that the routing can absorb up to Q(N) edge faults as well as 
Q(N/ log N) node faults. (A similar result based on Rabin's original algorithm has 
recently been discovered by Giladi ([G]).) We also briefly sketch a way to potentially 
improve its tolerance to faults in as many as a constant fraction of components by 
combining the decomposition scheme with our adaptive algorithm for routing around 
faults. 

All of chapter two represents joint work with Tom Leighton. In addition, lemma 
2.7 is the result of work with Bill Aiello and Satish Rao. 
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2.1.2 Overview 

Section 2.2 contains the 0(log N) time adaptive routing algorithm. In section 2.3, we 
show how to improve Rabin's fault-tolerant results with a simpler algorithm. 

2.2 Fast Routing Around Faults 

In this section we examine the problem of routing a permutation on a faulty hyper- 
cube. We describe a variant of Valiant- Brebner routing on the hypercube that we call 
offset routing. The success of the algorithm depends on local routability, a condition 
of the nonfaulty processor's connectivity. We show that with a probability close to 1 
a faulty hypercube remains locally routable and that if it does, the routing algorithm 
works with high probability. 

We make several assumptions about the nature of faults and about the abilities 
of the network's processors. Every node and edge of the hypercube is assumed to fail 
independently of other components and with a constant probability p < 1 — M 1/2. 
Every node is able to detect whether a neighboring node or the link to it is faulty by 
simply sending a one bit message and waiting for a response. It does not matter if 
the node cannot detect whether the fault lies in the neighbor or the link. We make 
the minimal assumptions about the messages themselves. Since routing information 
uses ©(log N) bits and must accompany each message, we assume that each packet 
contains f2(logiV) bits. 

The idea of the offset routing algorithm is to route around the faulty components. 
Say a hypercube node v holds a message from some source and that the route to 
the destination dictates that the message be sent to its neighbor v k across the k th 
dimension. Further assume that the edge (v,v k ) has failed. One way to pass the 
message on would be to find a dimension i^ k for which all components in the path 
(v,v t ,v lk ,v k ) are nonfaulty. A picture of this path appears in figure 2-1. 

Unfortunately, if some node on the path from source to destination has failed 
and paths like that shown in figure 2-1 are used exclusively, the message will not 
get through. To allow for the existence of node faults, we make sure that once we 
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Figure 2-1: A path of length three avoiding a faulty edge. 

have decided on a path from the source to the destination, the message never resides 
in any of the processors along the path until it reaches its destination. The path is 
treated as a virtual path. Instead of residing in some node v along the virtual path, 
the packet will reside in some neighboring node v\ That is, it will be offset by the 
dimension i. If dimension k is to be traversed, some other offset j will be chosen for 
which the entire path (v\ v ij , v ijk , v 7 *) is fault-free. Thus, instead of residing in node 
v k , the packet will be offset by dimension ;'. In this fashion, the offset path skirts 
around the virtual path but never meets it until the packet reaches its destination. 

The offset routing algorithm uses randomness in two different ways. First, ran- 
domness is used to select virtual paths from sources to destinations. The virtual 
paths we will use are precisely the paths chosen by the Valiant-Brebner algorithm. 
Second, the offsets used along the way will be chosen from among those which create 
a live path of length three to the next offset node. 

In section 2.2.1, we define butterflies and we review the Valiant-Brebner routing 
algorithm. We prove some important bounds on the number of messages the algorithm 
is likely to rout through small sets of edges. In section 2.2.2, we define another 
network, the butterfly with jump edges, which helps us to think about the offset 
routing algorithm on the hypercube. In section 2.2.3, we describe the offset routing 
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algorithm explicitly. Section 2.2.4 proves a limit of O(logiV) on the length of any 
offset path. Finally, section 2.2.5 shows that only O(log N) other messages use any of 
the edges of a particular message's path. This proves that the offset routing algorithm 
finishes in O(log N) routing steps. 

2.2.1 Valiant-Brebner Routing 

The virtual paths we will use are those dictated by the Valiant-Brebner routing al- 
gorithm. Since that algorithm is viewed more intuitively as a butterfly algorithm, 
we will present it that way. First, we review some basic butterfly concepts. Next we 
present the Valiant-Brebner routing algorithm. Last, we prove two lemmas about how 
uniformly the algorithm uses edges. These lemmas will be useful when we examine 
the usage of edges by the offset routing algorithm. 

The log JV x iV-node or log N- dimensional butterfly is obtained from the iV-node 
hypercube by replacing each node v of the cube by a cycle (v , ui, . . . , u n -i, v ). We 
replace each edge (v, v') by a pair of edges (u,_i, v\) and (v\_ lt Vi) (mod n). We can 
visualize the set of nodes {vi\v G H n ) as sharing a level of the butterfly. We call edges 
of the form (u,_i,u.) straight edges and those of the form (u,-_i,uj) cross edges. All 
edges connect nodes in adjacent levels (mod n). 




Figure 2-2: A three level butterfly. (The top and bottom rows are identified.) 
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All dimension i hypercube edges connect the (i — l) st level with the i th level. Thus 
any hypercube algorithm which only uses one dimension during each step and only 
uses consecutive dimensions during consecutive steps can run on the butterfly just 
as quickly. Any butterfly algorithm works as well on the hypercube from which the 
butterfly was obtained. We may regain this hypercube by collapsing columns of the 
butterfly. 

The Valiant- Brebner hypercube routing algorithm is also a butterfly routing algo- 
rithm. A packet starts at some node v and ends at some node w . We think of the 
column of nodes {u,} as being shared by the hypercube node v, which assigns each 
node in the column a different queue from a set of n queues. If a message traverses 
the straight edge (uj_i, v;) in some butterfly step, then it is passed from the node u's 
(i — l) at queue to its i th queue in the hypercube step. If the message traverses the 
cross edge (u,-_i,uj) in some butterfly step, then it is passed from u's (i - 1)" queue 
to u"s i th queue in the hypercube step. 

Routing from v to w Q is simplified by the fact that there is a unique path of 
length n between those two nodes. The i th step in the path connects a node at level 
i - 1 with one at level i. If v and w agree in the i th bit, the edge is a straight edge. 
If they differ, a cross edge is used. For example, to route from the node (1, 1,0) to 
the node (0,1, 1) we would use the path (1, 1,0) , (0, l,0)i, (0, 1,0) 2 , (0, 1,1) . 

In the first phase of the Valiant-Brebner routing algorithm, each node in level 
first sends its packet to a random node in the same level using the unique path of 
length n. In the second phase, the packet is routed along the unique path to its true 
destination. In [VB] it was shown that this algorithm takes 0{n) steps to complete 
and uses total queue length 0(n) at every hypercube node, with high probability. 

We will worry about congestion, or the total number of messages using a given 
set of edges, in the offset routing schedule. A message can congest an edge only if 
its virtual path brings it close to that edge. It will then congest the edge only if 
particular choices of offset are made. To bound the congestion, we will first bound 
the number of messages whose virtual paths come close to a given set of edges. We 
need only the following two bounds on the number of messages traversing small sets 
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of edges via their Valiant-Brebner paths. 

Lemma 2.1. Taice an arbitrary set of h edges on one level of the n-dimensional 
butterfly. Then with high probability the Valiant-Brebner routing scheme routes 
only 0(h + n) messages through edges in the set. 

Proof. Note that each message can congest at most one edge in the set. The following 
analysis applies to the first phase of the routing algorithm. The analysis for the second 
phase is almost identical. 

Say the edges share level / of the butterfly. Then we can partition the butterfly's 
first / levels into N/2' nonintersecting butterflies Bi,B 2 ,..., B N/2 i each built from a 
subcube with 2' nodes. For a message to route through one of the h edges, it must 
start in the same butterfly as the edge. Say that hi of the edges lie in butterfly £,-. 
Because paths are chosen uniformly, each message is equally likely to traverse any of 
the edges in a level of Bi. Thus each message starting in butterfly B, has probability 
Pi = hi/2 1 that it will hit one of the edges in the set. 

For a node v, let X v = 1 if v's packet congests an edge in the set and otherwise. 
We wish to bound the value of X = £ V X,,. To do so > we bound the moment 
generating function Af (A) = E[e xx ] for positive A. We can then bound Pr[X > kh] = 
Pr[e xx > e Xkh ] < e~ xkh E[e xx ]. This bound follows directly from Markov's inequality 
Pr[Y > b] < E[Y]/b for any nonnegative random variable Y and nonnegative bound 
b. We will first bound the moment generating functions M„(A) = E[e XXv ] We can 
then use the fact that, since the X v are independent, M(A) = n M„(A). 

The moment generating function M„(A) will depend on the butterfly J3, to which 
v belongs. If v G B { then M v (\) = E[e xx *\ = (^fe A + 1 - $r). Precisely 2 l nodes in 
the butterfly share this moment generating function. Thus the moment generating 
function Af(A) for X satisfies 



N/2 1 /. t \2' 

«w - n(r i + i -|) 



- n 1+^ 
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< n 



„(e*-l)hi 



The inequality between lines two and three follows from the inequality 1 + 1 < e z 
for all x. 

Thus Pr[X > kh] < e ^ x -^ fl e- khX = (e e "-* A - 1 )' 1 . Setting A = lnfc, this implies 
Pr[X > kh] < ( e *( 1 - ln *)- 1 )'' ) a bound which can be made as small as desired by 
increasing the constant k. 

If h > n, then the probability that more than kh messages pass through the 
edges is less than N-°( klnk ). Similarly, if h < n, the chance of having more than kn 
messages crossing the set is also less than N~°( k]ak ). ■ 

Lemma 2.2. Taice an arbitrary set of 0(n 3 ) edges in the n- dimensional butterfly. 
With high probability the Valiant- Brebner routing scheme routes onlyO(n 3 ) messages 
through edges in the set, counting a message once for each time it traverses an edge 
in the set (i.e. counting according to multiplicity). 

Proof. We will examine each level separately and then sum across levels. Say level 
/ has e, edges from the set. By lemma 2.1, for any k there is a c such that there is at 
most an N~ k chance that more than c(ei + n) messages traverse the e ( edges from the 
set at level /. Summing over all levels, with probability at least 1 - nN~ k , the number 
of messages crossing edges from the set at any level is no more than c(£* ej + n 2 ). ■ 

2.2.2 Jump Edges 

As we mentioned earlier, the second use of randomness involves evading faults which 
lie on the virtual path chosen by the Valiant-Brebner routing algorithm. When we 
route on the hypercube, we have access to many more edges out of each node than 
we do when we route on the butterfly. We can use these edges to route around 
faulty components. While bits are changed consecutively by traversing virtual paths, 
arbitrary bits are changed during fault avoidance. We create the butterfly with jump 
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Figure 2-3: Jump edges. These edges form the hypercube connections for the nodes 
on each level. The dashed edges are from the underlying butterfly. 

edges to accentuate the changing of adjacent bits in the virtual path while allowing 
for the changing of arbitrary bits in the offset path. 

A jump edge is an edge of the type (vj, t>}). Jump edges are not butterfly edges. A 
packet traversing such an edge would be sent (in the hypercube) from the j th queue 
of v across the edge (v, v*) and deposited in the j ih queue of v'. Note that all n jump 
edges of the type (vj, v 1 -) ,j varying, are actually manifestations of a single hypercube 
edge from v to v'. This means that every hypercube edge is represented n -f 2 times 
in the butterfly with jump edges: as n different jump edges and 2 cross edges. Figure 
2-3 depicts the jump edges for the 3x8 butterfly. 

If we collapse the levels of the butterfly with jump edges, we regain the hypercube. 
Any algorithm we create for the butterfly with jump edges works as well on the 
hypercube. We need only be especially careful about congestion, or multiple packets 
crossing the same edge. A cross edge or jump edge traversed by a given packet is 
actually one out of several appearances of a hypercube edge in the butterfly with 
jump edges. Any congestion on another manifestation of the hypercube edge could 
slow the packet down. Among other things, we will concern ourself with the total 
congestion on a hypercube edge traversed by a packet, not just the congestion on the 
particular cross edge or jump edge it traverses. 
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2.2.3 Offset Routing 

In the offset routing algorithm, each packet remains fairly close to its Valiant- Brebner 
path. A packet's location always differs from where their algorithm would send it by 
some offset which is a random dimension. The offset routing algorithm retains the 
two-phase structure of Valiant and Brebner's algorithm. 

We first describe how packets are routed from level to level in the butterfly with 
jump edges. Recall that the path traversed by a packet in the Valiant-Brebner scheme 
is its virtual path. In the offset routing algorithm, a packet whose virtual path would 
pass through the (it - 1)" level at the node Vk-\ will pass through the level at some 
node v i k _ 1 instead. If its virtual path would leave Vk-i via a straight edge, then the 
offset path will traverse three edges of the type (v k _ 1 ,v k J _ 1 ,v , k 3 ,v[). It finds such a 
path by randomly choosing a dimension j ^ z and attempting to route across the 
appropriate three edges. If the packet encounters a fault in any of the three edges or 
the nodes along those three edges, it returns to the node vl^, which chooses another 
random dimension and tries again. Note that this means that a packet might have to 
traverse many more than three edges to pass from one level to the next. If the virtual 
path would leave v k _i via a cross edge, then the offset path traverses three edges of 
the type (t>Li> u k-i> v k *> *>£*) instead. Note that no matter whether straight edges or 
cross edges are used in the virtual path, the node ends with a random offset j from 
its virtual location. If necessary, the k th bit is changed to agree with the k th bit of 
the destination. Figure 2-4 presents an offset path between adjacent levels. 

Each packet must choose an initial offset to leave its source and must remove 
its final offset to reach its destination. To begin, the message generated by node 
v repeatedly chooses a random dimension j and attempts to route across the edge 
(u , uj) until it successfully finds an initial offset. Say that the message reaches the 
th level at the end of the second phase with offset i (i.e. it reaches the node w' ). 
Then to conclude, the message finds an offset ;' for which the path (u>£, u# , u^, w ) is 
fault-free. 

The offset routing algorithm combines Valiant and Brebner's strategy of changing 
adjacent bits with a means for avoiding faults. In our analysis, we will make fun- 
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Figure 2-4: A virtual edge between adjacent levels (shown as a dashed line) and a 
possible offset path (shaded). In this example, i = 1 and j = 3. 

damental use of the property of the distribution of virtual paths proven in lemma 
2.2. The even distribution of virtual paths will help to ensure the even distribution 
of offset paths over the edges of the hypercube, assuming random offsets are chosen. 

2.2.4 The Length of Offset Paths 

If a packet is to arrive at its destination within O(logiV) steps, certainly the path 
it takes must have length O(logiV). In Valiant and Brebner's algorithm, the length 
of paths is fixed at 2 log N. Offset paths are of variable length, depending on faults 
encountered along the way. In this section we describe the condition of local routabil- 
ity. We prove that if the hypercube is locally routable, then with high probability all 
packets traverse offset paths of length 0(log N). 

Essentially, a hypercube is locally routable if every node always has ample oppor- 
tunity to send a packet to the next level in the butterfly with jump edges. Consider 
a path (vi_ 1 ,v i k j _i,vl jk ,v j k k ). We assume that a message has successfully arrived at 
v\_ x and so there are six components— three nodes and three edges — in the path that 
must all work properly. If the probability of failure is p < 1 - tf\ (about 0.11) and 
the faults are independent, then each such path has probability p' = 1 - (1 - p) < 5 
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that it has a faulty component. For subsequent analysis, we would like it to be the 
case that for all pairs vjt_i,i there are at least a constant fraction of offset dimen- 
sions j which lead the message on a functioning path (v'k-ii v k-n v k : , v 3 k ). We would 
also like to know that at least a constant fraction of the paths (v t k _ 1 ,v^_ l ,v , ^'v 3 k ) 
are fault-free for all pairs Vk-i,i. To begin the routing, we need that for all nodes 
v , a constant fraction of the edges (v Q ,v J ) function properly. To end the routing, 
we need that for all pairs v , i, a constant fraction of offset dimensions j lead the 
message on a functioning path (v^Vq ,vl,v ). Define the following sets of paths: 
Pv^i = {(4-i,vti,v k jk ,vi k ), j varying}, F Vk _ x<i = {(4_i> ^v? ,v k ), 3 varying}, 
Qv = {(vo,Vo), j varying} and Q^i = {{v^v'q 3 ,v J , v ), j varying}. Fix an % > 0. If 
all possible sets P Vk _ ui , P' Vk _ x ,i, Qvq and Q^ { all have cardinality at least e p n we say 
the butterfly is locally routable. 

Lemma 2.3. Assume that the probability that any component fails is less than 
1 - V| and that all failures occur independently. Then there exists sufficiently small 
e p > and d = c x {t v ,p) such that with probabihty N~ C1 the butterfly is locally 
routable. 

Proof. The set P Vk _ u i of paths available at uj^ are node-disjoint. (The same ar- 
gument holds for sets of paths P', Q and Q'.) Thus the faultiness of any path is 
independent of other paths in the set. 

The probability that fewer than (jn paths P Vk _ lt ij are fault-free is 

g(:) (1 -^- 

The ratio of successive terms is ^- = ("-^M 1 ^') , which is greater than and bounded 
away from 1 for small enough e p . Thus the sum is bounded by a constant times the 
last term. Let exp 2 (:r) denote 2 X . Then the last term t tpn 

= /' n J(i- p ')^v (1 " £p)n 

= exp 2 (e p n log e - e p n log e p + e p n log (1 - p') - ^n log// + n logp') 
= exp 2 (/i(ep,p') n + " lo g/>') 
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We can bound this last expression using the fact that each path has a low prob- 
ability of containing a failure. Since p' < ^, logp' = — 1 — c 2 for some c^ > 0. For 
e p = 0, h(e p ,p) = and the above expression equals JV~ 1-C2 . Since h(e p ,p) is con- 
tinuous, there is a e p > such that the above expression is bounded by jV" 1-03 with 
c 3 > 0. Since there are only 0(N log 2 N) pairs Vk-i, i, any C\ < cz makes the lemma 
true. ■ 

With probability N~ c * for some fixed c 4 , some node has only faulty neighbors. 
Thus we cannot strengthen lemma 2.3. For the remainder of section 2.2, we assume 
the butterfly is locally routable. Under this assumption, we will prove that the 
algorithm succeeds quickly with high probability. 

Lemma 2.4. Say a butterfly has faulty components but is locally routable. With 
high probability each message in the offset routing traverses a path of length 0(n). 

Proof. We will prove that any given message's path is of length 0(n) with high 
probability. Since there are only N messages, this will imply the lemma. Assume 
that at some point in its route, the packet is at the node v\ where v is the node 
it would traverse in the Valiant-Brebner scheme. Assume as well that the packet is 
scheduled to traverse dimension k. (If the straight edge is to be used or if the packet 
is at the beginning or end of the route, the analysis is identical.) Then if the packet 
successfully chooses to jump across dimension j, the path (fjt_n u fc-i) u fc »*>* ) must 
have no faults. Since the butterfly is locally routable, e p n of the possible paths to 
choose are fault-free. If a faulty path is chosen, the packet encounters the fault and 
returns to v\_ x using no more than six edges. Since a random dimension is chosen at 
each step, the probability that a packet takes more than 66(2n + 2) steps is less than 
the probability of at least (6 - l)(2n + 2) heads in a sequence of b(2n + 2) tosses of a 
coin with probability t v of landing tails. This probability is less than 



f b(2n + 2)\ ^ v(6_l)(2n+3) 

^ 2n + 2 J 

Ub(2n + 2)\ 2n+2 ^ v(6-l)(2n+2) 



V 2n + 2 J 
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Figure 2-5: One packet might delay another packet's progress even if they never cross 
paths in the butterfly with jump edges. One hypercube edge is replicated n + 2 times 
in the butterfly with jump edges. The two sections of paths shown here intersect in 
the hypercube because the darkened edges are actually one edge in the cube. 

an inverse polynomial in N for large enough b. I 



2.2.5 Delay From Other Packets 

Now that we know each message moves a distance of 0(log N) during an offset routing 
phase, we need to show that its forward movement is delayed by at most O(log N) 
other packets. These facts together will bound the packet's time to its destination. 
We will show that few other packets choose virtual paths in such a way that they 
have a non-zero probability of selecting an offset path which congests a given node's 
path. We will then show that even fewer of those actually congest the path when 
they use offset paths. 

Recall that a cross edge or jump edge traversed by a given packet is actually one 
out of several appearances of a hypercube edge in the butterfly with jump edges. Any 
congestion on another manifestation of the hypercube edge will slow the packet down. 
Therefore we group all n + 2 copies of the edge together and refer to the group as one 
hypercube edge. 
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Lemma 2.5. Consider a set E ofO(n) hypercube edges and butterfly straight edges. 
Let S be the set of butterfly edges such that any packet whose virtual path crosses 
an edge in S has a non-zero probability of congesting an edge in E as a butterfly edge 
in its offset path. Then with high probability, there are 0(n 3 ) packets whose virtual 
paths traverse any of the edges in S, counting a packet several times if it traverses 
several edges in S. 

Proof. If (wi_i,wl) is a butterfly edge traversed by a packet's offset path then the 
packet's virtual path must use an edge of the form (w} : , w} 3 ^) for some pair i,j. There 
are only n 2 such pairs. The same reasoning would hold if the edge in question were 
a straight edge. Since \E\ = O(n), \S\ = 0(n 3 ). By lemma 2.2, only 0{n 3 ) packets 
traverse edges in S, with high probability. I 

Lemma 2.6. Let T be the set of butterfly edges such that any packet whose virtual 
path crosses an edge in T has a non-zero probability of congesting some edge in 
E as a jump edge in its offset path. Then with high probability, there are 0(n 3 ) 
packets whose virtual paths traverse any of the edges in T, again counting according 
to multiplicity. 

Proof. Say (wk-i, u>jt_i) is a jump edge traversed by a packet. Let (uJUiJ^fc-u * i v i ) 
or (v 1 k _ 1 , v\°_ x , v ki z 4) be the path used by the packet when it traverses the jump edge. 
Then (w, w 1 ) is either the first or the last edge traversed in the path. If it is the first, 
then Wk-i = Ufc_i, w'k-i — v 'k-i an0 ^ therefore / = j. The edge traversed in the virtual 
path would have been (vk-i,v^) or (ujt_i,Ufc) for some k. There are n choices for v 
such that Vk-\ = w k _ 1 and n choices for k. Thus there are only 0(n 2 ) elements of T 
whose traversal in some packet's virtual path gives the packet a non-zero probability 
of traversing the edge (w, w') as a jump edge. The same reasoning holds for use of 
the jump edge as a third edge. Again, since \E\ = 0{n), \T\ = 0(n 3 ). By lemma 2.2, 
only 0(n 3 ) packets traverse edges in T, with high probability. I 

Lemmas 2.5 and 2.6 also hold for the set of edges incident to the set of nodes 
{vk} for some hypercube node v. If we bound the number of packets congesting these 
edges then we bound the number of packets ever residing in queues in the node v (the 
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queue size of v). 

We would like to bound the number of packets which congest the path p of 
some message m . There are 0(n 3 ) packets with a nonzero probability of congesting 
some edge along po- Focus on one of these packets m T . The packet m r will cause 
congestion along the path p only if an unfortunate pair of offsets i and j are chosen 
for it. When the packet traverses from level / r to the level l r + 1, its offset i at the /' A 
level is inherited from a choice made by some node in the (/ r — l) 5 ' level. Then the 
node at level l r chooses an offset j which will route the packet m r to the (/ r + l) 3t level. 
Because any given fault can affect the routes of several different packets, offset choices 
made by different packets are dependent. However, the packet m r is guaranteed by 
local routability to have a(r) > e p n choices of offset i available at level l r - 1. One 
of those offsets, say i,, will be chosen uniformly to route the packet m r to level l r . 
Once there, some number a ra of the offsets j will cause the packet m T to cross the 
path po, if the packet is routed to level l r + 1 using one of those a r , offsets. Since 
we wish to minimize the probability that such congestion occurs too often, we are 
concerned that choices for i„ are made which leave too many unfortunate choices of 
j. By lemmas 2.5 and 2.6, we know that there exists a constant d such that with high 
probability £ r sa rj < dn z . This follows because summing the ct r , is a second way to 
count virtual paths traversing edges in S and T, counting a path once for each time 
it traverses an edge in S or T. Finally, each <x rt is at most n, since there will be a 
total of n offsets j from which to choose. In the next technical lemma, we use these 
bounds on the a r ,, or number of unfortunate choices of offset ;', to bound the number 
of bad choices of j's left once all packets have had the offsets i chosen for them. 
Lemma 2.7. Consider a family of nonnegative integers {a rt \l < r < z,l < s < 
cr(r)} wizere <r(r) > ^n for all r, £r» a r , < dn 3 and a rs < n for all pairs r,s. If 
exactly one index s r is chosen uniformly in [l,<r(r)] for each index r then with high 
probability £ r ct T3r = 0(n 2 ). 
Proof. Let X r = a r3r . A picture of the choice of the X r appears in figure 2-6. 

We wish to bound the value of X = L^r. As in lemma 2.1, we bound the 
moment generating function M(A) = E[e xx ] and then we bound Pr[X > bn 2 ] 
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Figure 2-6: The a ra and a possible choice of X r . Each row represents the choice 
for some packet. The entry a ra counts the number of offsets j which, if chosen in 
conjunction with the offset i 3 , would cause the packet m r to congest the path p Q . 
Circled entries represent the selections from each row. 

Pr[e xx > e A6 " 2 ] < e~ A6n3 E[e xx \. As before, we will first bound the moment generating 
functions M r (A) = E[e XXr ) = ^y E«i e Aa " • Again, since the X r are independent, 
M(A) = nM r (A). 

If we could find a TX < a ry and a positive integer 6 such that < a rx — <5, a ry +S <n 
then by transfering 8 units from the smaller a rx to the larger a Ty we could only increase 
M r (A) (for positive A). This follows because e Aa " - e A ( a "" 5 > = e A ( a "-*V* - 1) < 
e Aa rv ( e A5 _ 1) = e ^ry+s) _ e A Qrv< The resultant change in M r (A), (e A < a "' + * ) - e Aa '») - 

(e Ao " - e A(c, "" 5) ), would be strictly positive. By this reasoning, if A r = E. a r» is 
fixed, we maximize Af r (A) by setting all terms except possibly one equal to either 
or n. Thus 



E[e XXr ] < i 



^(e A *' + «7(r)-l) HA r <n 

1^*1^ + "(')- MM) ifA ^ n 
For the rest of the proof we fix A = J. If A r <nthenM r (£) < ^y(e^+<r(r)-l) 
< _i_M 4- M* 4. a ( r \ _ 1) < 1 + -24*. (The second inequality uses the fact that for 
< 7 < 1, e 7 < 1 + 27.) If 4. > n then M r (±) < ^(^e + <r(r)) <! + *£■ 
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In either case the bound is at most 1+^f < exp(^). Thus M(£) < Fir ex P(r^f) 

< N~*p . Continuing the reasoning of the first paragraph of the proof, Pr[X > bn 2 ] 

< e~ bn N 'p . We can make this probability an arbitrarily large negative power of N 
by letting b be a large constant. I 

Each packet m T may take several attempts before reaching level l r + 1 safely. On 
each attempt, the packet m r may congest the path po- The packet always has at 
least a e p chance that it will make it to level l r + 1 on any given attempt. Thus the 
number of trials it requires to succeed will be distributed somewhat like the geometric 
distribution with parameter e p . For ease of notation, set ft = a T3r from the previous 
lemma. In each attempt by the packet m r , there are ft choices of offset which will 
produce congestion. The following technical lemma will help bound these multiple 
contributions. 

Lemma 2.8. Consider a family of nonnegative integers {ft|l < r < z) where 
£ r/ 3 r = 0(n 2 ) and /3 T = 0(n) for all r. Let {g r } be a set of random variables 
with geometric distributions g r ~ G(ep) (i.e. g r = a with probability ^(l - e p ) a " ) 
Tiien with high probability, Z r g r Pr = 0{n 2 ). 

Proof. Order the integers by increasing size ft < & < . . . < &. Then since 

ftn+l, ftn+2> •■ •■> /3(fc+l)n-l 

are all at least as large as ft*, we know that e1=i _1 ftn = 0(n). We assume that 
ft = 0(n), so the sum ft + £S1 ftn = 0{n). 

Now with high probability, all sums £"=1 g kn +r are 0(n). We know that 



£0rft < £(]C0Jfcn+r)/W)r 
r k r=l 

Thus, with high probability, ErSrft = 0(n 2 ). ■ 



Theorem 2.9. If we route using offset routing and the hypercube is locally routable, 
then with high probability, all packets are delivered in 0{logN) steps and all nodes 
have total queue size 0(log N). 
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Proof. Focus on the path po of a particular message m . We will show that the 
congestion along p from various sources is O(n) with high probability. 

Lemmas 2.5 and 2.6 bound the number of messages which have the potential to 
congest an edge of m 's path while passing between levels on their own paths. Enu- 
merate the packets m l5 m2, . . . , m z which have a non-zero probability of congesting 
Po while traversing an edge from an even level to an odd level in their virtual paths. 
A particular packet may appear several times in the enumeration — once for each even 
level node along its virtual path from which it might congest an edge of p . 

The packet m r has at least Cpix paths which would successfully route it to the next 
level. Arbitrarily designate exactly e p n of these paths as special. For the purposes 
of our analysis, we require m r to choose a special path before we allow it to route to 
the next level. This can only increase the amount of congestion placed on any edge, 
since it increases the number of attempts made by each packet. However, once m T 
does choose a special path, we always place it in the last node of the first fault-free 
path it found. Thus m T winds up in the same place on the next level as if no special 
requirements had been made. 

Consider the choice of offsets made by the message m r at even level l r . Let q r be 
the number of choices of pairs of offset dimensions (i,j) for the message m r which 
would congest an edge in m 's path. Then £ q r = 0(n 3 ) by lemmas 2.5 and 2.6. (as 
described in the discussion immediately preceding lemma 2.7, £ q r is a second way 
to count the number of edges in S and T according to multiplicity.) 

The choice of the dimension i was actually made for m T at level l r - 1. The choice 
was made randomly and uniformly from the set of offsets which led to a fault-free 
path to level / r . The exact selection of offsets i are dependent from packet to packet 
and, for a particular packet, from one level to the next. However, no matter how 
we condition on previous events, there are always enough offsets to choose from at 
any given moment. Also, the bounds on the probabilities of congesting po will hold 
regardless of previous events. Let i x < i 2 < ■ ■ . < i a ( T ), cr( r ) > e p n > De tne choices 
of offsets at level l r - 1 which lead to a fault-free path to level l r . Let a r , equal the 
number of offsets j such that if m T is routed from level l T - 1 to level l r using offset 
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i, and next to level l T + 1 using offset j, then congestion results in mo's path. Then 
since Yls a rs = 9r, J2rs a rs = 0(n 3 ). Since the total number of offsets j is n, clearly 
a T „ < n. Let i, r be the offset for m r actually chosen at level / r — 1. Lemma 2.7 
implies that with high probability £ r a r3r — 0(n 2 ). For convenience of notation, set 

At level / r , whether the message m r chooses a path from the set of t p n special 
paths or the set of (1 — e p )n nonspecial paths, it has at most /? r choices which congest 
m 's path. Thus whether we condition whether the choice was special or nonspecial, 
the probability that message m r will congest m 's path is bounded by f^. 

Now that we have bounded the probability that the packet m r will congest the 
path p during one of its attempts to route to level / r + l, we can bound the probability 
that too many packets actually congest p . The number of routing attempts made by 
m r is g r ~ G(ep). On each attempt, the probability that m r will congest m 's path 
is at most -^. Each attempt is an independent trial and the sum of the probabilities 
of congestion in the trials is at most ^EflvA-, which is 0(n) by lemma 2.8. By a 
moment generating function argument identical to that in lemma 2.1 and 2.7, with 
high probability 0(n) attempts actually did congest m 's path. Since each attempt 
involves at most six edges, each attempt can add at most six to the congestion on 
m 's path. Thus with high probability, the total congestion on the path from routing 
attempts at even levels is 0(n). 

Next examine the congestion on p from other packets beginning and ending their 
paths. For a packet to congest an edge as the first jump edge of its path, it has 
to be generated by one of the edge's endpoints. Thus there are at most 0(n) such 
packets. Now consider those packets congesting p during the ending of their paths. 
Each of the three jump edges used to finish off a path has an endpoint which is at 
distance one from the virtual destination. Thus at most 0(n) packets exist which 
have the potential to congest any given edge as the first, second or third of these jump 
edges. Therefore a total of 0(n 2 ) packets have a non-zero probability of congesting 
some edge of p as they finish their routes. An argument along the lines of the one 
bounding congestion at even levels shows that congestion from these sources is 0{n) 
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as well. 

The same argument bounds congestion from routing attempts at odd levels, and 
also bounds congestion on edges incident to any fixed node. I 

2.3 Information Dispersal Routing 

The offset routing algorithm cannot tolerate faults which occur during a particular 
routing phase. If a packet resides in a node as it fails, that packet is irretrievably lost. 
Rabin ([Rab]) discovered how to use the technique of information dispersal to route 
even in the presence of failing nodes, provided each fault occurs with probability no 
more than 0(l/n 2 ). 

In this section we will present a simpler variation of Rabin's algorithm. We also 
show how our algorithm handles faults occurring with probability 0(l/n). First, we 
will briefly sketch the main ideas of the original routing algorithm. Each packet is 
dispersed into n pieces sent along node-disjoint paths to different locations and then 
along node-disjoint paths to the final destination. 

Since every piece needs to carry fi(n) bits of routing information, the original 
packets must necessarily be large. For concreteness we assume that all packets contain 
L = 0(n 2 ) bits. Any piece created will contain 0(L/n) bits. We also assume that all 
links and nodes have the capacity to hold a constant number of the original packets 
(and therefore 0(n) pieces). 

Rabin proves that with high probability, the number of pieces crossing any node 
or link never exceeds its capacity. No piece's progress is ever delayed by a full queue 
in the node ahead. This guarantees that each piece can move during every step and 
that the entire routing will take no more than 2(n + 1) steps — n + 1 steps for each 
piece to arrive at its random intermediate location and another n + 1 to arrive at its 
final destination. 

As Rabin points out, routing with dispersal of information can tolerate faults if 
the dispersal into pieces is done with more redundancy. The pieces may actually be 
constructed in such a way that the arrival of half (or some other constant fraction) 
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of them is enough to reconstruct the original message. Rabin shows how to do this 
through matrix multiplication. He then proves that if each link has probability 1/n 2 of 
failure, then with probability l-2iV(4e/n)" /4 all messages will be safely reconstructed 
at their destinations. 

2.3.1 Routing Along Parallel Paths 

Our improvement of Rabin's results stems from a more uniform and efficient selection 
of paths for the routing of pieces. The n pieces are first sent sent to the neighbors 
of the node which generated the packet. These pieces are then routed along parallel 
paths to the neighbors of a random intermediate node. From there the pieces are 
routed along parallel paths to the neighbors of the intended destination, and from 
there to the destination itself. Except for the dispersal of the pieces to the neighbors 
of the source and the recovery of the pieces from the neighbors of the destination, the 
algorithm can be viewed as a butterfly algorithm. We will use the butterfly for our 
analysis. A picture of a set of parallel paths appears in figure 2-7. 

If v and w are two hypercube nodes, let 7r,(u, w) be the path from v l to w x used in 
one phase of the Valiant-Brebner scheme. Let U(v, w) = {wi(v,w)\l <i<n} be the 
set of all possible such paths. We will first show that if each node v chooses a node 
v' uniformly and then routes a different piece along each of the n paths in n(u, v') 
that only 0(n) pieces reside in any node's queue at any time step. 
Lemma 2.10. Consider the collection of all paths in the N sets U(v,v') (varying 
over v), where each hypercube node v has chosen a node v' randomly and uniformly. 
For any node u and any integer < j < n, with high probability u is the j th node 
along only 0(n) paths in the collection. 

Proof. If u is the j tk node along the path 7r,(u, w) then u % = w x w 2 . . ■ WjV j+1 . . . v n . 
Separate the two cases in which either i < j or * > j. If i < ;', then it must be 
that v j+ i ...v n = u j+1 . . . u n . Precisely 2> nodes satisfy this condition for v. If one of 
these nodes chooses a w such that w x . . . lUj.iiyiiy.+i . . . w, = u a . . . u, for some i < j, 
then u will be the j th node along exactly one path v { {v,w). Otherwise, u will be the 
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Figure 2-7: A path (dashed lines) and its adjacent set of parallel paths (shaded). 

j th node along none of the paths ni(v,w), i < j. Thus for each of the 2° nodes, the 
probability of exactly one such path is j/2 j and the probability of no such paths is 

l-j/2 j . 

If i > j, then v j+1 . . . u.-iiw+i . . . v n = u j+1 ...u n for some i > j. Precisely 
(n - j)2 j nodes satisfy this condition. All reasoning is the same as in the previous 
case, except now w must be chosen so that w x . . .Wj = u x . . .Uj. Thus the probability 
that u is the j ih node along exactly one such path is 1/2 J . The probability that no 
path 7r,-(u,u>), i > j crosses u in this fashion is 1 — 1/2 J . 

We now need only consider the sum of 2 j 0-1 random variables each with proba- 
bility jj of equalling 1 and (n - j)2 j 0-1 random variables each with probability ^■ 
of equalling 1. Call this sum X. Then the moment generating function M(A) for X 

satisfies M(A) 

1 \ (»-i)2 J 



(^-^-4) 



= i + 



2i 



7' 



1 + 



(e X -lY 
2' 



(n-j)2' 



< 



e (<= A -l)J e (e A -l)(»-j) 



_ M^-l) 



= e 



Thus Pr[X > bn] < e n ^- l h- bnX = (e 6 *-^- 1 )". Setting A = In 6, this implies 
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Pr[X > bn] < (e^ 1-1 "^ -1 )", an inverse polynomial whose exponent can be made as 
small as desired by increasing the constant 6. I 

The i th piece created from u's packet is sent to u', along the path ni(v,w) to w l 
and then to w. By lemma 2.10, at no time do more than bn pieces cross a given 
hypercube node, with high probability. Since the packets traversing any link all come 
from one of the link's endpoints, no more than 2bn pieces cross the link during any 
step of the routing. If all links and nodes have the capacity to hold 26 original packets, 
then with high probability no buffering is necessary and no piece waits in a queue. 

This analysis assumes that each node routes its packet to a random destination. 
If we use two phases as in the Valiant-Brebner scheme, the results extend to arbitrary 
permutation routings: 

Theorem 2.11. If all packets are divided into n pieces which are routed along parallel 
paths in both phases of the routing algorithm, then for an arbitrary permutation, with 
high probabiUty the two-phase routing takes 2(n + 1) steps. No piece waits at any 
time. I 

2.3.2 Fault-Tolerant Encoding of Pieces 

By giving the pieces more structure, we can make the information dispersal fault- 
tolerant. We partition each packet F into n 0(Z,/n)-bit pieces, but in such a way 
that if any m = n/2 of the pieces arrive at the destination, the original packet may 
be reconstructed. 

Matrix multiplication is used to encode and decode the pieces. We need annxm 
matrix A every m rows of which are linearly independent. We use the Hilbert matrix 
Aij = l/(x, + yj ), where x, £ *,-< Vi ^ i', Vj ^ y r V; ^ j' and x { + yj^0 Vi, j. For 
all these distinctness conditions to hold we need a large field. We will use the field 
GF(2 a ), with s w log log N = logn. 

Let A' be the matrix formed by rows i 1 ,i 2 ,...,i m of A. Then 

u /i _ Uk<i(x ik - Xi t )(y k - yi) 

11 ILw(*u + y«) 
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Using this identity and Cramer's rule, we can invert any m rows of A in 0(m s ) 

steps. 

To take advantage of the Hilbert matrix A we block the bits of the message F 
into a matrix over the field GF(2 a ). Write F = b l ...b,, where / = L/s and each 6, is 
an 6-bit byte interpreted as an element of GF{2"). Group the 6,'s into l/m columns 
of m bytes each, and call this matrix B. Each source u computes F u . . . , F n as the 
rows of the matrix product AB: 



an 
a 2 i 



aim 

fl2m 



b\ b m+ i 

&2 &m+2 



02m 



I-m+1 


= 


^2 


b > 




. F " . 



Given m pieces (rows F„), the destination w can reconstruct 5 (i.e. the packet F) 
since the corresponding m rows of A are linearly independent. The destination just 
inverts the matrix containing those m rows. 

Note that it takes 0(m 2 llm) = 0(nl) word operations to multiply these matrices, 
or 0(nls log s) bit operations. The routing itself will take only 0(nls) bit operations, 
so there exists a log s = log log log N gap between the complexity of the encoding and 
the routing stages of the protocol. 

2.3.3 Fault-Tolerance via Parallel Paths 

If we encode the original packet in the pieces via Rabin's matrix multiplication, then 
we can bound the probability that u's packet is lost by the probability that some n/2 
of its pieces run into faulty components. But if that many pieces are lost, then at 
least n/4 are lost during one of the two phases of the routing algorithm. Assume they 
are lost in the first phase; the reasoning for phase 2 is identical. There are at most 
(2n + 3)n different components (nodes or links) encountered by pieces from v during 
the first phase. We need the following bound on the number of intersections between 
the routes of different pieces. 
Lemma 2.12. For any hypercube node u ^ v,w, no more than two paths in Tl(v,w) 
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cross u. 

Proof. Count the nodes along the path ■K{{v,w) starting with v l as the 0"* node. 
Say that the k th node along iri(v,w) is the same as the I th node along ttj(v,w) for 
i < j. Then w[w\ . . . i4 u l+i • • • u n = ^i"^ • • • W u /+i • • ■ u n> where v q q ' = v q iff q ^ q' 
and similarly for w q q . 

There are four cases. If jfc, / < j then v] = uj, a contradiction. Similarly, if fc, / > z 
then w{ = w\, a contradiction. If k < i, I > j or if / < i, fc > j then it must be true 
that Wi = tJj, Wj = u, and w h = v h ior i < h < j. Thus all 7T fc (u,u;) with i < h < j 
are precluded from crossing u (otherwise wk = Vh, a contradiction). Therefore three 
paths cannot all cross u. I 

Since no component's failure will affect more than two pieces, it must be true that 
at least n/8 of the (2n + 3)n components have failed. Only in a small fraction of fault 
patterns will so many failures occur in such a small set of components. 

Theorem 2.13. For any constant k > tiere is a sufficiently large constant b > 
such that if each component of the hypercube fails independently with probability 
l/bn before or during some permutation routing, then with probability 1 - N~ the 
routing will be successfully completed. That is, a given packet will arrive at its 
destination iff both its origin and destination do not fail. 

Proof. Whether or not the i th component fails gives rise to a 0-1 random vari- 
able whose moment generating function is M<(A) = (£e A + (1 - £)). The moment 
generating function for the sum of these random variables is M(A) 



1 + 



e A _ 1 \ ( 2 "+ 3 ) n 



< e • 



Thus we can bound the probability that more than n/8 of the components fail by 
exp ((' A - 1 )(2n+3) _ x&y Setting A = ln^, we see that the probability of so many 
failures is no more than (e*(16/6)i) n . The exponent of this inverse polynomial in N 
can be made as low as desired by increasing the constant b. ■ 
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2.4 Remarks 

Offset routing and information dispersal are complementary techniques. By com- 
bining this simplified variant of information dispersal with offset routing, still better 
results are possible. The combined routing algorithm tolerates the failure of a con- 
stant fraction of the hypercube's components during the course of the routing of a 
single permutation. To send a packet, the node first disperses pieces to a well defined 
set of n nodes at distance three (instead of neighbors). The packets are then routed 
along parallel offset paths to the symmetric set of n nodes close to the destination. 
Finally, the pieces are combined at the destination. If each node or link fails inde- 
pendently of other components and if in the case it fails it does so at a random time 
during the routing then this combined algorithm tolerates the failure of a constant 
fraction of the hypercube's components. 
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Chapter 3 

Reconfiguration in the 
Presence of Faults 



3.1 Introduction 

In this chapter, we continue our investigation of the tolerance of the hypercube to 
randomly distributed faults. The techniques we develop assume a long-run view. 
Given that faults have accumulated in a hypercube over time, each component inde- 
pendently faulty with probability p, we would like to be able to program the machine 
while ignoring whatever faults exist. We show how to use the functioning parts of a 
hypercube with faults to simulate a hypercube without faults at a surprisingly low 
cost. More precisely, we show how to embed a hypercube in the functioning part of 
a hypercube with faults so that features such as locality are preserved. 

Before we can state our results formally and assess their value, we first must de- 
scribe the constraints, assumptions and objectives of network reconfiguration and/or 
simulation in the presence of faults. We divide the discussion into six general topic 
areas: preservation of locality, load balancing, message traffic, simulation overhead, 
algorithms for implementation, and modelling of faults. 

An embedding of a network G\ into another network G 2 is a map 4> : G\ *-* G 2 
that maps each node of G x to a node of G 2 and each edge of Gi to a path in G 2 
between the images of its endpoints. 
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We call the pattern of faults F. That is, we include each component of H n in F 
independently and with probability p. Therefore the functional part of the hypercube 
is H n - F. An embedding of H m into H n - F is a map <f> : H m >-> H n - F that maps 
nodes of H m to functioning nodes of H n and edges of H m to functioning paths of H n . 
The precise definition of functioning nodes and paths will vary, although the general 
interpretation is straightforward. 

Preservation of locality. Because communication in hypercube-based machines 
is mostly local, and because communication is a dominant factor in measuring the 
performance of a parallel machine, it is crucial that a good embedding of H m in 
H n - F preserve locality. In other words, neighboring processors in H m should be 
mapped to nearby processors in H n - F. In order to quantify this notion, we say 
that an embedding has dilation D if for each edge e in H m , the path <j>(e) has length 
at most D in H n - F. Of course, it is most desirable to find embeddings with small 
dilation. At the very least, the dilation of an embedding (j> is a lower bound on the 
time required for H n - F to simulate a single step of H m if the computation of each 
node v G H m is performed by <f>(v) in H n — F. 

The notion of dilation can also be extended to paths. We will describe natural 
embeddings of H n _x in H n - F for which nodes separated by distance d in # n _i are 
mapped to nodes separated by distance d + 2 in H n - F. These embeddings have 

dilation 3. 

Balancing the load. We will consider embeddings which allow several nodes of 
H m to be mapped to a single node of H n - F. Mappings that are one-to-one are the 
most desirable since then each processor of H n - F only has to simulate the action 
of a single processor of H m . In general, we define the max load of an embedding to 
be the maximum number of processors of H m mapped to any single node of H n - F. 
One algorithm we describe discovers embeddings with max load 1 (i.e. one-to-one 
mappings) while the other finds embeddings with constant max load. 

In addition to having small max load, it is desirable to use as many of the func- 
tioning cells of H n - F as possible. The use of live cells is partly described by the 
max load. To further characterize this quantity, we define the expansion of an em- 
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bedding to be the ratio of the size of the largest one-to-one hypercube we could hope 
to embed in H n — F to the size of the hypercube that we do embed. Since the size of 
a hypercube is always a power of two, the expansion is 

2Li°g(i-p)2"-mj _ 2l n - m - lo 8rbJ 

We focus on embeddings of i/ n _i in H n — F for p < \. Such embeddings have 
expansion one, which is the best possible. 

Message traffic. In addition to balancing the processing load among the func- 
tioning processors, it is desirable to balance the message routing load among the wires. 
In particular, it would not be good if many paths {<j>(e)\e € H m } traversed a single 
wire of H n — F since local communication along these paths would require the use 
of the same wire. To formalize this notion we say that an embedding has congestion 
C if every edge of H n — F is contained in at most C paths of {<j>(e)\e € H m ). We 
consider embeddings with congestion as much as 0(logiV) and as little as 0(1). 

Congestion is a lower bound on the time required for the functioning part of H n 
to simulate H m if messages traversing e in H m are routed along <f>(e) in H n — F. For 
some specific applications, however, we can do better. For example, hypercubes are 
often used to simulate bounded-degree networks such as arrays and trees. In such 
applications, only a constant number of wires incident to any node are used in any 
parallel step of H m . Hence, the effective congestion in the corresponding embedding 
may be much less than it seems at first. To capture this notion, we say that an 
embedding has induced congestion I if every edge of H n — F is contained in at most / 
paths of {<A(e)|e € H m } for which the edges e G H m are node- disjoint. The two main 
algorithms in this chapter find embeddings with constant induced congestion. Such 
embeddings are particularly useful for simulating trees, arrays, normal hypercube 
algorithms and other structures with bounded processor degrees. 

Simulation overhead. One obvious use of a hypercube with faults is to simulate 
a hypercube without faults. This can always be done given enough slowdown and 
duplication of resources, but the goal is to make the simulation as efficient as possible. 
The key factors influencing the efficiency of the simulation are dilation, max load and 
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congestion. By achieving good bounds on these vaiaes, we show that any step of 
a hypercube H n can be simulated by 0(1) steps of H n — F. In addition, we use 
the notion of induced congestion to show that a hypercube with faults can simulate 
trees, arrays and other bounded-degree networks of the same size with only constant 
slowdown. 

Algorithms for implementation. In addition to proving that there is an ef- 
ficient embedding of H m in H n — F, it is desirable to develop an efficient algorithm 
for finding the embedding. Ideally, the algorithm would be deterministic, fast, easy 
to implement, and decentralized (i.e., using only local control). In fact, we describe 
such algorithms in sections 3.2 and 3.3 We also describe a fast, local probabilistic 
algorithm in section 3.5. 

Modelling of faults. In general, we might consider three types of faults in 
H n . The most serious fault would be one that completely destroys a node and all 
wires incident to it. We call such faults total. A less serious fault would be one that 
destroys just the computational portion of a node, and leaves the communication (i.e. 
switching or routing) portion of the node intact as well as the incident wires. We call 
such faults partial. (Note that it does not make sense to consider a fault that destroys 
just the communication portion of the node. The computation portion would then 
also be useless since it would be disconnected from the rest of the network.) Last, 
faults could occur in individual wires. 

In our model, no malicious faults occur. Any node can determine if a neighboring 
node or link has failed by probing the link in 0(1) time. 

Along with the type of fault, the distribution of faults must also be specified. As 
with routing in chapter two, we consider a model in which faults occur independently 
among components with probability p. We restrict our attention to the situation 
where p < §, although the methods can be extended for larger p. In addition, we 
consider the case in which the number of faults is smaller than a constant fraction 
of the total number of nodes. The assumption concerning independence of faults is 
crucial to our analysis, but the methods can also be applied in a hierarchical setting 
where entire subcubes of nodes may fail at once. Such extensions might be useful in 
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a practical setting where the actual machine may consist of a collection of boards, 
each of which consists of a collection of chips, and so on. 

The material we present is philosophically related to previous work in fault tol- 
erance of arrays in the context of wafer-scale integration ([Gr],[GG],[LLl],[LL2]), al- 
though the techniques and results are quite different. For example, constant dilation 
reconfiguration is not possible for arrays and trees. There has been relatively little 
previous work on the fault-tolerant reconfiguration of hypercubes (to our knowledge). 
An exception is the work of Becker and Simon ([BS]), who consider fault-free sub- 
cubes of a hypercube containing worst case faults. The constraint that the embedded 
cube be a subcube (i.e., dilation 1) is very restrictive, as is the assumption that faults 
are located in a worst case fashion. Hence, the techniques and results of [BS] are quite 
different from those presented here. Another exception is the work of Dolev, Halpern, 
Simons and Strong ([DHSS]) who also study worst case bounds. Their model of com- 
munication also differs from ours in that they assume that after the faults occur, the 
new connections must be chosen from a predetermined set of routings. 

3.1.1 Summary of Results 

At first, we consider an N-node hypercube containing random partial processor faults. 
We describe algorithm 3.1, an algorithm for embedding an iV/2-node hypercube in 
the functioning processors. 

Theorem 3.3. Algorithm 3.1 is a local, deterministic O(logiV) step algorithm. If 
the nodes of H n fail independently and partially with probability p < \ then with 
probability at least 15/16 algorithm 3.1 constructs a one-to-one embedding of H n _\ 
into H n - F with dilation 3, congestion 2 log N, and induced congestion 2. 

Next we improve this algorithm so that it embeds an iV/2-node hypercube in 
the functioning processors with the same performance with probability 1 — N~ cu 
provided that processors are faulty with probability p < 1/2, for sufficiently large N. 
The algorithm for finding the embedding is deterministic, easy to implement, runs in 
0(log N) parallel steps, and uses only local control. As a result, we extend the results 
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of Bhatt, Chung, Leighton and Rosenberg [BCLR] and others to be fault-tolerant. 
In particular, we show that a hypercube with partial processor faults can simulate 
any binary tree or mesh of the same size with only constant factor slowdown. The 
most surprising (and potentially most useful) feature of the embedding is the degree 
to which it preserves locality. 

Next, we extend the results to handle total faults. We describe an embedding for 
which the dilation is 7, the max load is 1, the congestion is 0(log iV), and the induced 
congestion is 0(log N/ log log N) with high probability (for sufficiently large N). The 
algorithm for achieving these results is probabilistic, runs in O(log N) steps, and uses 
only local control. 

Finally, we address the issue of congestion. We demonstrate a probabilistic al- 
gorithm which with high probability finds an embedding in a hypercube containing 
totally faulty processors for which the dilation is 5, the max load is 0(1), and the 
congestion is 0(1). 

Theorem 3.22. For each p < 1 - \fZ (about .16J there is an O(log N) step algorithm 
such that if each of the nodes of an N-node hypercube fails with probability p then 
with probability 1 — N~ Cli the algorithm finds an embedded fully functioning N-node 
cube with constant load, dilation and congestion. The paths which simulate the edges 
of the cube only use live nodes. 

As a consequence, a faulty hypercube can simulate a functioning hypercube of the 
same size with constant delay. 

These last two algorithms actually work in a semi-worst case setting. As long as 
a constant fraction of each node's neighbors remain alive and a constant fraction of a 
specified set of paths for each node have no faults along them, the good embeddings 
exist. 

Chapter three is the result of joint work with Johan Hastad and Tom Leighton. 
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3.1.2 Overview 

In section 3.2, we consider only partial faults which occur with probability p < 1/4. 
We extend the algorithm to handle failure probabilities up to 1/2 in section 3.3. A 
probabilistic algorithm for reconfiguring with total faults appears in section 3.4 and 
section 3.5 contains a probabilistic algorithm achieving constant delay reconfiguration 
with total faults. In section 3.6 we describe a way to implement the algorithms of 
section 3.5 so that they run in O(log 2 N log log TV) time. In section 3.7 we extend our 
results to the cases where the probability of failure is very low and also to the case 
where edge faults occur. 

3.2 Embeddings for Small p with Dilation 3 

In this section we consider the less severe model of partial faults where it is possible to 
use the faulty processors as switches and to route through them. We assume that the 
probability that any given processor fails is less than or equal to 1/4 and we present an 
algorithm which with probability 15/16 constructs a one-to-one embedding of #„_i 
in H n with dilation 3, congestion 2n (= 21ogiV) and induced congestion 2. 

3.2.1 Mapping Dead Nodes to Live Nodes 

Let H n -i be the subhypercube on N/2 = 2 n_1 nodes induced by the nodes with first 
coordinate zero. For each node v in H n , let v' be the node with first coordinate 
different from u's whose coordinates otherwise agree with those of v (i.e. u's neighbor 
across the first dimension). Also, for a node y = (t/i,j/2> • • • >3/n-i) in #n-i, let y be 
the node in H n with coordinates (0, t/i, j/2> • • • > J/n-i)- 

Given some pattern of failure for the nodes in H n , say a node v G -fiT n -i is rich 
if both v and v' are live and poor if both v and v' are dead. If every node in #„_i 
were not poor, we could easily embed H n -\ in H n - F by mapping (yi, . . . , y n -i) to 
whichever of {(0, y u ..., y B -i), (1, Vu ■ • ■ . J/n-i)} were alive. Unfortunately, there will 
be a constant fraction of poor nodes in #„_i with very high probability since each 
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Figure 3-1: Borrowing from a neighbor. Dead nodes are shown as white. The arrow 
points from the simulated node to the simulating node. 

node in H n -i is poor with probability p 2 . 1 

We handle the existence of poor nodes by mapping each poor node v to a neigh- 
boring rich node w. At most one v is mapped to a given node w. Hence v can borrow 
w'. In this fashion we will be able to embed #„_i in H n - F much as if there were 
no poor nodes at all. At step k algorithm 3.1, shown in figure 3-2, will attempt to 
assign v to v k+1 , if v and v k+1 are as yet unassigned. 



for k <— 2 to n for all nodes v 

if v is poor and unassigned and v k is rich and unassigned assign v to v 



Figure 3-2: Algorithm 3.1. 



3.2.2 Analysis of the Borrowing 

If processors fail with probability less than or equal to 1/4, algorithm 3.1 will construct 
an embedding with probability at least 15/16. We show this by proving two simple 



1 We use the phrase Q is more than Q(g) with very high probability to mean "There exist constants 
k and d independent of N such that the probability that Q does not exceed dg is less than 2~ ." 
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lemmas. First we will prove a lemma which will be of crucial importance to the later 
analysis. 

Lemma 3.1. At step k — 1, what has happened to v is independent of what has 
happened to any node which differs from v in at least one of the last n — k coordinates. 

Proof. At step i, nodes that affect each other have all coordinates identical except 
the z'th coordinate. Thus at step k — 1, if we divide the nodes into groups each having 
identical last n — k coordinates, all previous communication has taken place within 
the groups. Thus nodes in different groups cannot affect each other. I 

Lemma 3.2. The probability that a given node v is poor and unassigned after the 
i th step is at most (2p)'p 2 . 

Proof. For each node v let p, = Pr[u is poor and unassigned after step i] and q t - = 
Pr[u is rich and unassigned after step »]. Then p = p 2 and q Q = (1 — p) 2 . A node v 
will be poor and unassigned after step i + 1 if and only if it was poor and unassigned 
after step i and the node it requested in step i + 1 was not rich and unassigned. A 
similar statement holds for whether a node is rich and unassigned after step i + 1. 
Thus, since these probabilities are independent, 

ft+i = «(1 -Pi) 
Pi+i = P<(1 ~ ft) 

Subtracting the two equations yields g,+i — p, + i = <fc — p,-, which is natural since the 
surplus of rich nodes over poor nodes is constant. Thus the difference is <?,- — p,- = 
qo - Po = 1 - 2p, or ?,• = 1 - 2p + p,. Therefore p i+1 = p,(2p - p.) < (2p)pi and so 
Pi < (2p)«Po = (2p)'p 2 . ■ 

The probability that an individual node is poor and unassigned at the end of the 
algorithm is less than (l/2) n_1 (1/16) = 1/8JV. Thus the probability that some node 
is poor and unassigned is no more than (N/2)(1/8N) = 1/16. 
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Figure 3-3: Mapping an edge to a path. The heavy edges form the path chosen to 
simulate the edge between the two simulated nodes at bottom. 

3.2.3 Embedding Edges 

If the algorithm successfully assigns each poor node to a rich node call the assignment 
ip. Embed i/n-i in H n with the embedding <j> which maps nodes in H n -\ to nodes in 
H n hy 

y \{y£H n -F 

4>-y»\ (y)' if (y)' e H n - F but y e F 

ij){y)' otherwise 
and maps edges in i/„_i to paths in H n by 

(<t>(y),<t>(z)) ii < f,( y ) = y ,<i>(z) = z 

<*> ■ (y,z) -> < (4(y), 4>(y)', y, <f>{z)) if <f>(y) * M*) = ~ z 

{<Ky),(yY,(*)'M*)) x«v)*vM*)** 

Although the mapping of the edges looks complicated, every edge simply maps to a 
shortest path between the corresponding nodes. Figure 3-3 depicts an instance of the 
third possibility. Since in all cases the length of the path is at most 3 the embedding 
has dilation 3. 

To check the congestion, observe that a given edge is used on a given path <f>((y, z)) 
only when one of its endpoints is <j>(y), y, or (z)'. Checking cases shows that no edge 
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could lie on three paths corresponding to node-disjoint edges. Thus, since we can 
partition the edges of the hypercube into n matchings, the congestion can be no 
worse than 2n. By this argument it also follows that the induced congestion of the 
embedding is at most 2. 

Theorem 3.3. Algorithm 3.1 is a local, deterministic 0(logN) step algorithm. If 
the nodes of H n fail independently and partially with probability p < 1/4 then with 
probability at least 15/16 algorithm 3.1 constructs a one-to-one embedding of //„_i 
into H n — F with dilation 3, congestion 2\ogN, and induced congestion 2. 

3.3 Embeddings with Dilation 3 for p < 1/2 

In this section we extend the algorithm of section 2 so that it can handle independent 
faults with probabilities exceeding 1/4 but less than 1/2. This is best possible in the 
sense that if p > 1/2, then more than half of the nodes will fail with probability at 
least 1/2. In that case it would be impossible to achieve a one-to-one embedding of 
H n _, in H n - F. 

Call a node v € #n-i a topnode if v is dead but v' is alive. We now handle the 
existence of poor nodes by mapping each poor node v to a neighboring node w which 
is either rich or a topnode. If w is a topnode, we make sure that w has a rich neighbor 
u so that w can borrow u' . We call this process pushing a topnode. 

Algorithm 3.2, shown in figure 3-5, will carry out the program outlined above in 
4 stages. The only additional feature is that poor nodes without enough topnode 
neighbors will be treated separately. 

Observe that conflicts can only occur during stage 4, and can be easily resolved 
by having the node with lower index win. 

Lemma 3.4. Assume that nodes fail independently with probability p < 1/2 and 
that N is sufficiently large. Then there is a constant c 5 > such that after algorithm 
3.2 terminates, with probability 1 -JV -Cs : (i) every poor node is assigned to a neighbor 
which is either a topnode or a rich node, and (ii) every topnode which has been 
assigned to a poor node is pushed to a rich neighbor. 
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Figure 3-4: Pushing a topnode. Dead nodes are shown as white. The arrows point 
from the simulated nodes to the simulating nodes. 

3.3.1 Analyzing Stages 1 and 2 

Stages 1 and 2 comprise a "first pass" to assign poor nodes. In stage 1, those poor 
nodes with few top node neighbors are given first crack at assignments, since these 
nodes will have much less ability to push neighboring topnodes later in stage 4. 
Stage 2 replicates algorithm 3.1. We expect that the vast majority of nodes will find 
assignments during this stage. 

Let e be a small positive constant depending on p and let d = d(e, p) and c = c(e, p) 
be suitable positive constants depending only on c and p. Throughout the argument 
we will assume that N is sufficiently large. The neighborhood of a point is the set of 
points at distance 1 from the point, and a sphere denotes a sphere in the Hamming 
metric. 

Lemma 3.5. For e < p(l - p)/4>/2 tiere is a positive integer constant d = d(t,p) 
such that for sufficiently large N, with probability 1 - 1/iV no sphere of radius 6 
contains more than d nodes processed in stage 1. 

Proof. Take any sphere of radius 6 and any d points in this sphere. The union 
of their neighborhoods is of size at least dn - d? > dn/2. (N = T is assumed to 
be sufficiently large and any two neighborhoods do not have more than 2 points in 
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Stage 1: 

for every poor node v which has fewer than en topnodes as neighbors 
across dimensions > en do 
for k <— 2 to n 

if v k is rich and unmarked, mark it v 
Stage 2: 

for k «— 2 to en for all nodes v 

if v is poor and unassigned and v k is rich and unassigned assign v to v k 
Stage 3: 

for every node v which was processed in stage 1 

assign v to the node which was marked v 
for every node w assigned to a marked node during stage 2 
w becomes unassigned 
Stage 4: 

for all unassigned poor nodes v do 
for k *— en + 1 to n 

if v k is an unpushed topnode and there is an unassigned rich node 
u fcj for some j > e log N 

assign v to v k and push v k to v k * 

Figure 3-5: Algorithm 3.2. 

common). By assumption, at most 2den nodes in this neighbor set can be topnodes. 
Since the probability that an individual node is a topnode is p(l — p) the probability 
of having exactly i topnodes in a set of size dn/2 is 

ft=(f)wi-p))''(i-p(i-p))*"* 

Observe that pi+i/pt > V2 for i < p(l - p)dn /2\/2". Using this and the fact that any 
Pi is less than 1 we get 

2din 2den . 2d ^ 

,=0 t=0 

< 4p 2 <fcn 

,-2*™- ^-%"* 4 

< 4exp(— c 6 dn) 

The probability that fewer than 2den nodes in a set are topnodes decreases as we 
make the set larger than dn/2. Thus this bound holds no matter what the actual size 
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of the union of the neighborhoods may be. Since there are at most N possible spheres 
and at most f n d J ways of choosing d points, the lemma follows for large enough d. M 

Next we show that stage 1 has a good probability of success for the nodes to which 
it is applied. 

Lemma 3.6. The probability that there exists a node which has fewer than 2en 
neighbors across dimensions greater than era which are either rich or topnodes is 
bounded by N~ C7 for sufficiently small e > and c 7 > 0. 

Proof. This proof uses reasoning similar to that of the proof of lemma 2.3. Choose 

e' > such that (1 - c') logp = -1 - c 8 for c 8 > 0. The probability of having fewer 

than 6 log iV neighbors across dimensions greater than e' log N which are either rich 

or topnodes is 

5 ^/(l- e 01ogM (1 _ p)Vl _, )log ^ 

We can compare consecutive terms to show that this sum is bounded by a constant 
times the last term. The last term is 

/(l - O log N\ n _ stlogN (l-c'-OlofAT 

V SlogN ) { P) P 
((l-e')logAO^" rfiotN-n-JS)*** 

- ( S lot N yi n „ N V 1 P) P 

= exp 2 (6\oge(l -^) log N - Slog Slog N 

+61ogAriog(l -p)-61ogiVlogp+(l -t')logNlogp) 
= exp 2 (/i(^p)logiV + (l- e / )logiVlogp) 

For (5 = 0, h(6,p) = and the above expression is AT -1-1 *. Since h(S,p) is 
continuous, there is a S > such that the above expression is bounded by N' 1 ' 07 
with c 7 > 0. Finally set e = min{e',<5/2} and observe that decreasing e' to c can 
only decrease the probability of having at most SlogN topnode neighbors across 
dimensions > e'logiV. I 

By lemma 3.6, with probability 1 - N~ Cr any node processed in stage 1 must have 
at least en rich neighbors. The only way it could fail to mark one of these is if they 
were all marked by other nodes. This is impossible since by lemma 3.5 there is only 
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a constant number of nodes processed during stage 1 within distance 6. Thus each 
node participating in stage 1 successfully marks a rich node. 

Let us next analyze stage 2. Note that stage 2 is independent of stage 1. By lemma 
3.2, after stage 2 the probability that an individual node is poor and unassigned is 
< N~ C9 while the probability that it is rich and unassigned is > (1 — 2p). 

Let a 6 {0, l} en and let H a be the (1 — e)n dimensional hypercube which has the 
i th coordinate a,, i = 1, . . . , en. Observe that by lemma 3.1 the status of nodes in an 
individual hypercube are independent during stage 2. 

Lemma 3.7. There is a constant d = d(p) such that the probability that there is a 
sphere of radius 4 in any H a which contains more than d unassigned poor nodes after 
stage 2 is less than l/N. 

Proof. There are N ways to choose a sphere over all H^s and at most ( n d ) ways of 
choosing d points in each sphere. The probability that these d nodes are poor and 
unassigned is N~ dc ' and the lemma follows for sufficiently large d. I 

3.3.2 Analyzing Stages 3 and 4 

Stages 3 and 4 are responsible for assigning those nodes which remain unassigned after 
stage 2. In stage 3, nodes assigned in stage 1 negate some of stage 2's assignments. 
Bumped nodes find new assignments in stage 4. 

Lemma 3.8. With high probability, after stage 3 there are only 2d unassigned poor 
nodes in any sphere of radius 4 in any H a . 

Proof. This follows from lemmas 3.5 and 3.7. The only additional unassigned poor 
nodes come from the nodes whose assignments are stolen in stage 3. But since the 
thief is at distance two, lemma 3.5 bounds the number of poor nodes subject to theft 
in any sphere of radius 4. I 

To prove lemma 3.4 observe finally that stage 4 only works inside an individual 
H a . Fix a poor unassigned node at the beginning of stage 4. It has en topnode 
neighbors. The probability that each individual topnode neighbor does not have en 
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rich unassigned neighbors is N~ Cl ° . Thus with probability 1 - i each unassigned 
poor node has |n topnode neighbors with en unassigned rich neighbors each. Stage 3 
reduces the number of rich neighbors to each topnode only by a constant. By lemma 
3.8 we know that during stage 4 only 2d other unassigned poor nodes can interfere. 
Therefore stage 4 is successful and lemma 3.4 follows. ■ 

If the algorithm successfully assigns each poor node to a rich node or topnode and 
each pushed topnode to a rich node, call the assignment V>- Embed H n _ x in H n - F 
with the embedding <j> which maps nodes in H n _i to nodes in H n — F by 

y if ye H n -F 

t'-y*-* { (y)' if (v)' € H n - F, y e F and y is not pushed 
i>{y)' otherwise 

and maps edges in i/„_i to paths in H n — F precisely as discussed in section 2. 

Theorem 3.9. Algorithm 3.2 is a local, deterministic algorithm. For any p < ^ there 
is a sufficiently small constant c n > such that if the nodes of H n fail independently 
and partially with probability p, for sufficiently large N the following is true with 
probabihty 1 - iV _Cn . Algorithm 3.2 takes O(logiV) steps and constructs a one- 
to-one embedding of H n _ x into H n - F with dilation 3 and congestion 2\ogN. The 
embedding has the property that if a constant degree network C is embedded in # n _i 
then the induced embedding in H n — F has constant congestion. 

The only part of the theorem which we have not yet checked is the number of steps 
stage 4 takes. Figure 3-6 gives a more detailed description of the implementation of 
stage 4. First, each unassigned poor node is tentatively assigned to a constant number 
of topnode neighbors. Each topnode chosen attempts to tentatively assign itself to 
one of its unassigned rich neighbors. Each poor node then finds a topnode to which 
it is tentatively assigned which successfully was assigned to a rich node. 

Since by lemma 3.8 there are few unassigned poor nodes in any small sphere and 
we know that most topnodes will have many rich neighbors, the above procedure will 
assign every unassigned poor node to a topnode with high probabihty. 
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for all poor nodes v unassigned after stage 3 do 
for k <— en + 1 to n 

if v k is an unassigned topnode 

assign v to v k unless v is already assigned to 8d neighbors 
for all assigned topnodes u do 
for k *— en + 1 to n 

if u k is an unassigned rich node assign u to u k and stop 
for all poor nodes v unassigned after stage 3 do 
for k *— en + 1 to n 

if v k was assigned to v and succeeded in being assigned to a rich node w 
push Vk to w and assign v to v k 

Figure 3-6: Stage 4. 

3.4 Routing Using Only Live Nodes 

If we consider total faults instead of partial faults, algorithm 3.2 fails in several places. 
In fact, any path in the embedding which does not consist of only a single edge has at 
least one dead node internal to it. In order to handle total faults we will replace the 
paths of length 3 in H n - F which constitute the edges of the embedded hypercube 
with paths of length 7 which use only live nodes. 

In the remainder of the chapter we will use probabilistic algorithms. To guarantee 
the performance of these algorithms, we will need to know that certain assumptions 
about the distribution of faults hold true. These assumptions are stated in several 
lemmas (for example, lemmas 3.10 and 3.11). Given that these distribution assump- 
tions hold (which they do except with inverse polynomial probability), the algorithms 
work with high probability. The errors arising during particular executions of the al- 
gorithms are thus in some sense independent of the existence of unusual fault patterns. 

First we establish that all nodes have a reasonably large number of live neighbors. 

Lemma 3.10. There exists a c 12 > such that for any node v, the set N v of live 

neighbors of v has cardinality at least en with probability 1 - N~ C12 . 

Proof. A calculation almost identical to the one in the proof of lemma 3.6. ■ 

With probability close to 1, all pairs of nodes are connected by many paths each 
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Figure 3-7: A live path P ,J . The darkened path simulates an edge (u,v). 



of which contains only live nodes. 

Lemma 3.11. Suppose every node fails with probability p < 1/2. Then with prob- 
ability 1 - iV -012 there are fl(n 2 ) live paths of length at most 7 between any points u 
and v within Hamming distance 3, where we choose cu as in lemma 3.10. 

Proof. We will prove only the case where the distance is 3; the other cases are 
similar. Let P = (u,wi,w 2 ,v) be a path of length 3 between u and v. The paths we 
will consider are of the type P ij = (u,^ ,u {i ,w*j ,w l j ,v { ' ,v> , v). Let N u be the set of 
dimensions k for which u* is live and similarly for N v . Take the larger of the two sets 
{/>•'' |t € N u ,j € N v ,i < j} and {P ij \i € N u ,j € N v ,i > j}. By lemma 3.10 this set 
has cardinality e 2 n 2 /2. The interior 4 nodes of these paths are disjoint for different 
pairs i, j (if we discard i, j where either i or j is a dimension used along P) and the 
outer 4 nodes are all alive. Thus with high probability Q,(n 2 ) of these paths use only 
live nodes. I 

Once we have established the existence of live paths it is a simple matter to find 
them algorithmically. However, if we look for them deterministically it is difficult to 
bound the congestion. A random algorithm which uniformly chooses a random live 
path for each pair of nodes is easier to analyze. Before we show how well the random 
algorithm performs, we prove a simple lemma about balls and boxes. 
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Lemma 3.12. If each of a balls is placed randomly and uniformly into one of /3 
boxes, then with probability 1 — (ae/fii) 1 there are fewer than 7 balls placed in the 
first box. 

Proof. The probability that there are more than 7 balls in the first box is no more 

than (;)(I) 7 <MI(i r = (1)^.1 

Theorem 3.13. If we uniformly choose a random live path between each pair of 
chosen nodes at the end of algorithm 3.2, then with high probability the resulting 
embedding will have congestion O(log N) and induced congestion O(log N/ log log N). 

Proof. The estimates will follow from lemma 3.12. The balls correspond to the edges 
of the paths of the embedding and the boxes are the edges of H n . The paths which 
potentially share a given edge can be separated into classes. We assign a path to a 
class depending on which position in the associated live path the edge would occupy 
if the live path were actually routed through the edge. We will then show that with 
high probability the congestion due to the live paths associated with any one class is 
0(n). Since there will be only four classes, the result will follow. 

Fix an edge (s,<)- Given a path P, put P in class r if (s,t) is the r th edge along 
P ,; (reading from the closest end) for some pair (i,j). There are four cases we need 
consider. 

r = 1: Then s = u. Since there are at most n — 1 paths beginning at u, there are 
only n — 1 paths of this sort even in the worst case. 

r = 2: Then (s, t) = (u', u ,J ). There are n — 1 possible values for u, each an endpoint 
of at most n - 1 paths P. Since there are at least (cn) 2 /2 choices for (i, j) for each of 
these 0(n 2 ) paths, lemma 3.12 applies to show that the probability that more than 
0(log N/ log log N) of these paths are actually chosen to go through (s, t) is at most 
0(N- k ). 

r = 3: Then (s,t) = (u ij ,w\ j ). If w x € H' n _^ then the path P was embedded for the 
edge (u, w\). Thus only one path of this type exists for each pair (i, j). Ifu;i€ #n-n 
then the path P was embedded for an edge incident to w\. Thus only n - 1 paths of 
this type exist for each pair (i,j). Therefore the total number of paths P in this class 
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is no more than n 3 for any edge (s,t). Again the probability that any one of these 
paths will actually be chosen to go through (s, t) is no more than 2/(en) 2 . By lemma 
3.12 the probability that more than O(logJV) of these paths are chosen that way is 
at most 0(N- k ). 

r = 4: Then (s,t) = (u^.utf). There are two cases. If both w u w 2 € H^_ x , then 
P was embedded for «,u;°). Thus only one path of this type exists for each pair 
(i,j). If w x € H' n _ l ,w 2 € #"_ 15 then P was embedded for an edge incident to u^. 
Thus only n - 1 paths of this type exist for each pair (i,j). The rest of the analysis 
is identical to that of the previous case. 

Thus the probability that the congestion is more than 0(log N) is at most 0{N~ k ) x 
N log N/2 = 0(N~ k+1 log N). 

To prove the induced congestion is 0{logN/ log log iV), note that only one path 
from class 1 can contribute to the induced congestion. Note also that classes 3 and 
4 have only 0(n 2 ) paths in them which can contribute to the induced congestion, 
since the original edges could not have been adjacent. Thus the analysis for induced 
congestion due to classes 3 and 4 reduces to that of case 2 above. ■ 

3.5 An Algorithm for Constant Delay Embedding 

In section 3.4 we resorted to probabilistic means to find fault-free communication 
paths. We will use probabilistic methods again in this section, together with a more 
uniform view of the nodes of the cube. We allow the max load to rise to a constant, 
and in return we achieve constant congestion. 

To achieve a constant delay embedding, we need the load, dilation and congestion 
to all be constant. The embedding we will find will have a load and congestion which 
depend strongly on the probability of failure - clearly the more nodes that fail, the 
more nodes that have to be simulated by any one processor. However, the dilation 
will always remain five, and each processor will be simulated by one of its neighbors, 
provided that p < 1 - \TE (about .16). 

In order to simplify the analysis, each node (live or dead) finds a neighbor to 
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simulate it. We first assign nodes to live neighbors so that no node simulates more 
than a constant number of its neighbors. Then each pair of nodes simulating neighbors 
finds a live path between them of length five so that no more than a constant number 
of these paths congest any edge. We will use two similar algorithms to accomplish 
these two tasks. 

3.5.1 Assigning Nodes to Live Neighbors 

Let A p and s p be constants (to be determined later) which depend only upon the 
probability p of failure. Call a node unsaturated if it is live and if it has been assigned 
to simulate fewer than A p of its neighbors. Otherwise, it is saturated. 

The assignment algorithm proceeds in rounds. During a round, a previously unsat- 
urated node might be picked by enough unassigned nodes so as to exceed its capacity 
A p . In such a case, we require the node to accept enough of the simulation requests 
to saturate it. Algorithm 3.3 performs the first phase. 



for i = 1 to s p n 

for each unassigned node w 

w picks one of its neighbors uniformly 
each unsaturated node v agrees to simulate as many nodes as it can 
without exceeding its capacity 
all excess nodes remain unassigned 



Figure 3-8: Algorithm 3.3. 

Since the algorithm never assigns a saturated node to simulate another node, no 
node simulates more than Ap nodes. Thus, a constant load embedding results. 

To facilitate our proofs, we will first formulate a sequential algorithm similar 
to algorithm 3.3. We will prove that this new algorithm assigns to each node a 
neighboring node to simulate it. We will then show that, except for a small proportion 
of executions, the algorithms behave the same. 

In each round of algorithm 3.4, unassigned nodes act sequentially. Each node 
chooses a neighbor to simulate it only after all lower ordered nodes have chosen. We 
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for i = l to s p n 

for unassigned nodes w in arbitrary order 

if w has fewer than a p n unsaturated neighbors 

arbitrarily dedicate enough (saturated) neighbors 
w picks one of its neighbors uniformly 
if the chosen node is unsaturated or dedicated 

w is assigned to that node 
else w remains unassigned 



Figure 3-9: Algorithm 3.4. 



would like to ensure that all nodes have a large number of choices that will result in a 
successful assignment. Let a p depend only upon the probability p. If some node w has 
fewer than a p n unsaturated neighbors to choose from during its turn, we designate 
an arbitrary set of saturated neighbors as dedicated to w during its turn. If w chooses 
a dedicated node during that particular turn, the dedicated node agrees to simulate 
w even though it is saturated. We dedicate enough nodes so that w has at least a p n 
neighbors which, if chosen, will agree to simulate it. 

We will show below that with high probability no nodes are ever dedicated during 
algorithm 3.4. In that case, the result is the same whether unassigned nodes choose 
sequentially or in parallel. Thus we will show that algorithms 3.3 and 3.4 produce 
the same output. 

The following lemma proves that algorithm 3.4 terminates quickly. 

Lemma 3.14. With high probability all nodes have been assigned after s p n steps of 
algorithm 3.4, for sufficiently large s p . 

Proof. Because each node always has at least a p n neighbors which will simulate it if 
chosen, the probability that a given node is assigned during some step is at least a p , 
regardless of what has occurred in previous steps. Thus the probability that a node 
remains unassigned after s p n steps is no more than (1 - a P Y" n . This quantity is less 
than N~ k as long as s p > k/a p . ■ 

Lemma 3.15. For p < 1 - \/Z, there exists an e p and a constant c 13 > such that 
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with probability at least 1 — N ° 13 each node has at least e p n live neighbors. 
Proof. The probability that a node has fewer than tn live neighbors equals 



,=0 \K 



Since the ratio of consecutive terms is always greater than (1 — p)/p, this sum is 
bounded by a constant times its last term. That term is 



(;Jd -„)•>»->" <(;„)p<->» 



The second term in the product can be made less than jV~ 1-Cl4 for some c 14 by taking 
e small enough. The first term in the product can be made less than jV Cl4/2 by taking 
e small enough as well. The probability that some node has too few neighbors is 
bounded by the sum of the probabilities for the individual nodes. This multiplies 
the above bound by N. Thus for any e below both of these thresholds, the theorem 
applies. I 

The following two lemmas show that with high probability algorithm 3.4 never 
dedicates saturated nodes. Thus with high probability algorithms 3.3 and 3.4 behave 
identically. This proves that algorithm 3.3 assigns all nodes with high probability. 
Similar reasoning proves the Dance Hall Theorem described in the introduction. 

Lemma 3.16. Given a failure rate p, assume that every node has at least e^n live 
neighbors. Then with high probability a given node v never has fewer than a p n 
unsaturated neighbors available during algorithm 3.4, for ct p = ^f. 

Proof. For v to have fewer than et p n unsaturated neighbors at some point during 
algorithm 3.4, at least (c p - a p )n = a p n of u's neighbors must have become saturated 
during the course of the algorithm. 

Each node always has at least a p n neighbors (including dedicated nodes) to which 
it might be assigned during any step. Further, if it is assigned, it is equally likely to 
be assigned to any one of those neighbors. Thus no node has a probability greater 
than l/a p n that it will be assigned to any given neighbor, no matter what other 
assignments have been made previously. 
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To saturate a p n of u's neighbors, there must be at least A p a p n nodes at Hamming 
distance two from v each of which is assigned to a neighbor of v. There are no more 
than n 2 nodes which might be assigned to some node in u's neighborhood. Each 
one of these nodes has at most two neighbors of v to which it might be assigned. 
Although the probabilities of such selections are dependent, the probability a given 
node is assigned to a neighbor of v is at most 2/a p n, no matter what choices the other 
nodes made. The probability that at least a p n of u's neighbors become saturated is 
thus no more than 

/ n 2 W 2 \ Apapn ( 2e \ Apa " n 



\A p a p n) \a p nj \A p ajJ 

For A p large enough, this quantity is an inverse polynomial in N. I 

Lemma 3.16 implies that with high probability algorithms 3.3 and 3.4 behave 
identically. We know that algorithm 3.4 successfully assigns each node to a neighbor 
with high probability and that algorithm 3.3 never assigns more than A p nodes to 
any node. We conclude that algorithm 3.3 achieves a constant load embedding with 
high probability. 

3.5.2 Assigning Edges to Paths 

Once we've assigned simulating nodes, we need to find paths to simulate the edges 
in the hypercube. Say that v b simulates v and v kb ' simulates v k . Then to simulate 
the edge (v,v k ), the nodes v b and v kh> choose a path between them of the form 
P(v,v k ,b,b',r) = (v b ,v br ,v r ,v rk ,v rkb ',v kb '). To avoid ambiguity, we will refer to the 
choice of r as if it were made by v and v k even though v b and v kb actually choose. 

For two adjacent nodes v and v k , let S(v, v k , b, b') be the set of dimensions r ^ k 
for which P(v,v h , 6,y,r) is a live path. Because p < 1 - v^5, there is a chance 
(1 - p) 4 = s > i that any given path P(v,v k ,b,b',r) is live. Note that the paths 
P{v, v k , b, V, r) (r ^ k) are node-disjoint for a fixed choice of v, v k , b and b'. Thus the 
probability that any one of them is live is independent of the other paths. 

Lemma 3.17. With high probability, for all quadruples (v, v k , 6, V), \S(v, v k , b, V)\ > 
T] p n for some constant t/ p . 
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rkb' 




Figure 3-10: A choice of live path. 
Proof. Same as lemma 3.15, except that there are Nlog 3 N different quadruples. I 

With high probability, we know that all pairs of neighbors have many paths from 
which to choose. What remains is for them to decide in a systematic but local fashion 
how to choose from among these paths without congesting any edge too much. In 
the rest of this section, we explore a way to choose paths in this manner. 

Take a node v simulated by its neighbor v b and consider the set E Vi b of edges 
{(v br , v r )}. There are 2n 2 nodes w (all of the form w = v Tt or w = v hTt ) which (like 
v) might potentially use one of the edges in the set as a second edge along a path. 
Any node which actually does must be simulated by its neighbor across dimension b. 
The next lemma bounds the number of such nodes. 

Lemma 3.18. For sufficiently large 8 V and with high probability, of the 2n 2 nodes 
at distance or 2 from either v or v h , no more than 6 p n of them are simulated by 
neighbors across dimension b. 

Proof. As noted before, each node has a probability no more than l/a p n of borrowing 
across any given dimension, regardless of the choices made by other nodes. The 
probability that many nodes choose across the same dimension is no more than 



'2n 2 \ M x °' 
K S p n) \a pn/ 
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Of course, the actual probabilities depend on the particular * p n-size subset we con- 
sider and on the relative order in which the nodes of the subset successfully found 
neighbors to simulate them. Then any node's probabilities are conditioned upon 
other nodes' previous choices. No matter how these choices are made, however, the 
stated probabilities are upper bounds on the actual probabilities since when each 
node chooses it always has at least a p n choices. 

For sufficiently large 6 P , this is smaller than an inverse polynomial in N. ■ 

Each of the at most 8 p n nodes (except for v and v b ) can use at most two edges 
in the set E v , b as a second edge along some path. To use an edge as a second edge, 
such a node would have to be a neighbor of one of the nodes incident to the edge. If 
w is of the form w = „* then w is adjacent to „' and v t and no other node incident 
to an edge in E v , b . Similar reasoning applies to nodes w which satisfy w = iA ( . 
Trivially, each of „ and v» can use no more than n edges of E vJk as a second edge 
along some path. If we sum over all edges in E mJk the number of nodes which can use 
each edge as a second edge counting according to multiplicity, the total will be no 
more than (26 p + 2)n. Therefore no more than rj p n/4 of these edges will have more 
than 7p = 4(26 p + 2) fa, of those S p n nodes potentially using them as second edges. 
Let S'(v,b) = {r\ more than 7p nodes can send a path through the edge (u 6p ,t/ r )}. 
Then \S'(v,b)\<ri P n/4. 

Let T(v,v k ,b,b>) = S(v,v k ,b,b>) - S'(v,b) - S'^V). Then for each adjacent 
pair of nodes v and v k , |7>, »*, 6, b')\ > r, p n/2. The sets T(v, v k , 6, b') will be crucial 
for our reasoning. The probability that a pair successfully choose a path between 
them is lower bounded by the probability that they successfully choose the path from 
T{v,v k ,b,V). 

Note that among the edges in all the paths represented by the sets T(v, v k , b, &'), 
there are now only a logarithmic number of quadruples (w,wi,c,J) which might 
potentially congest any given edge. We've already limited the number of paths for 
which the edge is the second edge along the path. If the edge is the first edge along 
the path, then one of the edge's endpoints is the simulating node. Each endpoint 
simulates only a constant number of nodes, and each simulated node contributes 
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exactly n paths. If the edge is the third edge along the path, then the path is 
simulating an edge at Hamming distance one from the edge considered. There are 
exactly n edges of this type. The cases in which the edge is the fourth or fifth edge 
along the path are identical to the first two cases. Thus each edge can be potentially 
congested by no more than n P n = (4A P + 2^ p + l)n paths. 

We can now describe algorithm 3.5, which assigns paths to simulate edges. During 
algorithm 3.5, each edge will decide whether or not to accept some path routed 
through it. Because the other edges in the path simultaneously decide whether or 
not to accept the path, it is possible that some might accept it while others reject it. 
If this happens, we assume that an accepting edge counts the path as contributing 
to its load anyway. Call an edge saturated if it has accepted exactly B p paths routed 
through it. Otherwise, call it unsaturated. Order the pairs (v,v k ) lexicographically. 
As before, in any round an edge accepts an arbitrary set of pairs which try to route 
through it until it reaches its capacity. 

for i = 1 to s' p n 

for each unassigned adjacent pair of nodes (v, v k ) 

(v,v k ) pick a path between them uniformly 
each unsaturated edge agrees to as many paths routed through as it can 
without exceeding its capacity, deciding conflicts arbitrarily 
all excess pairs remain unassigned 



Figure 3-11: Algorithm 3.5. 

Parallelling what we did before, we will present algorithm 3.6, a sequential ver- 
sion of algorithm 3.5. We will show that this modified algorithm terminates having 
assigned paths between every pair of nodes simulating neighbors, with high proba- 
bility. Maintaining the parallel with what we proved earlier in this section, we will 
then show that the two algorithms perform indistinguishably, with high probability. 
At any time when the pair (v,v k ) attempt to choose a path between them during 
algorithm 3.6, let U(v, v k , 6, V) be the subset of T(u, v k , b, b') consisting of dimensions 
r for which all of the edges along P{v,v k ,b,V ,r) are unsaturated. Define the ded- 
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ication of a path containing a saturated edge in a fashion similar to the dedication 
of saturated neighbors before. We dedicate paths to the pair (v, v k ) whenever /3 p n 
choices for a simulating path do not exist. 

for i = 1 to s' n 

for all unassigned pairs (v,v k ) in arbitrary order 
i(\U(v,v k ,b,b')\<f3 p n 

dedicate enough r € T(v, v k , b, b') 
(v,v k ) pick a path between them uniformly 
if the chosen path is unsaturated or dedicated 

(v,v k ) is assigned to the path 
else (v,v k ) remains unassigned 



Figure 3-12: Algorithm 3.6. 

Lemma 3.19. For a suitably large choice of the constant s' p , with high probability 
all pairs of nodes searching for an assignment to a path have been assigned one after 
s' p n steps of algorithm 3.6. 

Proof. Each pair is successfully assigned with probability at least fi p during any step. 
The rest of the proof is identical to that of lemma 3.14. ■ 

We now show that with high probability algorithm 3.6 never adds dedicated paths 
with saturated edges to any U(v,v k ,b,l/). Thus with high probability algorithms 3.5 
and 3.6 behave identically. This proves that algorithm 3.5 assigns all necessary paths 
with high probability. 

Lemma 3.20. With high probability no set U(v,v k ,b,b') ever has cardinahty less 
than p n at the beginning of some step of algorithm 3.6, given /3 P = r) p /4. 

Proof. There are at most n P n pairs which have a non-zero probability of congesting 
a given edge on some path represented by an r € T(v, v k , b, V). Thus at most 5/i p n 2 
pairs have non-zero probability of congesting any of those edges, counting according to 
multiplicity. For a path to leave U(v, v k , 6, V) one of its edges must become saturated. 
For (t/p/2 — j3 p )n = (3 p n paths to become unavailable, B p fi p n pairs must choose a path 
crossing an edge on some path represented by an r € T(v, v k , b, b'). 
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The probability that a pair chooses any particular path is at most l//3 p n, no matter 
what other choices are made. Thus if there are q w<v/} paths that a particular pair 
(w,w J ) might choose which contain an edge on some path in T(v,v k ,b,l/), then the 
probability that (w,w>) chooses such a path is at most q WtWJ //3 p n, and "£ WtWj q WtWj < 
5/i p n 2 . 

By a moment generating function argument similar to those in lemmas 2.1 and 2.7 
and in theorem 2.13, the probability that more than (3 p n paths become unavailable 
is therefore no more than 0{N~ k ) for arbitrary k. ■ 

With high probability 0(n) steps are sufficient to select all paths. Since we have 
guaranteed that the paths have constant congestion, this proves the following theorem. 

Theorem 3.21. For each p < 1 — \CE (about .16J there exists an e p and an tj p 
such that with probabihty 1 — N~ Cii , at least t^n neighbors of every node are live 
and \S(v, v k , b, b')\ > t) p n for all quadruples (v, v k , b, b'). Given these facts hold, there 
is an 0(logN) step algorithm which with high probability finds an embedded fully 
functioning N-node cube in H n — F with constant load, dilation and congestion. The 
paths which simulate the edges of the cube only use live nodes. 

3.6 Implementing the Constant Delay Embedding 

As given so far, the algorithms of the previous section are far from implementable. 
Each node needs to know information about which nodes have decided to simulate 
which other nodes, which paths it may route through, whether or not certain tentative 
assignments have been finalized, and so forth. In this section we will show how such 
information might be exchanged in polylogarithmic time per step. This implies that 
the embedding of the previous section is obtainable in polylogarithmic time. 

Focus on any particular node v. Because v might be faulty, one of its neighbors 
must choose a simulating node for it. Arbitrarily, we will use the lexicographically 
smallest labelled live neighbor to simulate v during the course of algorithm 3.3. First, 
the neighbors must agree on which one of them is the lowest. During any step of 
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algorithm 3.3, that neighbor of v must inform all the other neighbors which one 
of them v selected during that step. Both of these operations are trivial once we 
understand how a node's neighbors can communicate even with faults. 

Each node v* broadcasts to u's other neighbors by first broadcasting to all of its 
neighbors. Then each node v x ' passes the information to its unique other neighbor 
which is also a neighbor of u, the node v j . A picture of this type of broadcasting 
appears in figure 3-13. 
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Figure 3-13: Broadcasting to other neighbors. 

We only care if the message gets through to the other neighbors of v which are live. 
The broadcast we have described sends the messages through a set of intermediary 
nodes, several of which are likely to be faulty. Thus if each node broadcasts just once, 
we might expect that several nodes will not receive the information they need. We 
remedy this problem by allowing each node to broadcast its information and then 
repeating the broadcast twice more. With probability 1 - iV -019 , every neighbor of v 
is informed of the activity of all of u's other neighbors. We prove this scheme works 
by showing that with probability 1 - N~ Cl> , for every node v of the hypercube, every 
live pair v' and v> of u's neighbors are connected by a live path consisting of two or 
three broadcasts. 
Lemma 3.22. With probability 1 - N~ c ", between every pair of live neighbors 
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v\ u J of every node v there is a live path of the form (v\ v ,k , v k , v k \ v*) or of the form 

(v\ v tk ,v k ,v kl ,v l , v l i,vi). 

Proof. Consider a node v and a neighbor v\ The neighbor v* successfully broadcasts 
to another neighbor v k exactly when both v k and the intermediary node v ,k are both 
live. Since the probability of failure is no more than 1 — \/\E, the probability that 
one or both of these nodes are dead is no more than 1 — yJ3. Further, none of these 
pairs {v k ,v tk } share a common node. Thus there is a p > and a c 17 > such that 
with probability 1 — iV~ Cl7 pn of the pairs will be live, for all neighbors v* of v. 

Next take two disjoint sets Si and 52 each containing pn neighbors of a given 
node v and consider the set T(Si,S 2 ) = {v k '\v k € Si,v l € S 2 and v kl is live}. Since 
there are p 2 n 2 pairs of nodes v k , v l which satisfy the first two requirements and each 
pair has a constant probability that it satisfies the last requirement (independent of 
other nodes), with high probability the set T(Si,S 2 ) is nonempty. There are no more 
than iV 2 ways to choose the sets S x and S 2 for any given node v. Thus with high 
probability the set T(Si,S 2 ) is nonempty for all choices of Si and S 2 . Since there are 
only N choices for v, with high probability for each node v and each choice of Si and 
5 2 , the set r(§ 1 ,5 2 ) is nonempty. 

With probability 1 - iV" 016 (for any < c 16 < c 17 ), the conclusions of the first 
two paragraphs hold. For a given node u, let V x = {v k \v lk and v k are both live} and 
V 2 = {v k \v jk and v k are both live}. Then if \\\ n V 2 \ ^ 0, a the path of length four 
connects v { and v*. If \V X n V 2 \ = then u* and v> are connected by a path of length 
six. I 

Before algorithm 3.5 can route the simulating paths, each node must know which 
nodes simulate the neighbors of the nodes it simulates. At the end of algorithm 3.3 
each live node knows which nodes simulate each of its neighbors. At least pn neighbors 
of every node are live. For every pair of neighbors v and v k , we only need some live 
neighbor of v to communicate with some live neighbor of v k . Each neighbor v l of v 
attempts to route along the path v\v ij ,v' jk ,v jk to each neighbor v jk of v k . We are 
only interested in the (pn) 2 paths which begin and end at live nodes. Each of these 
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paths intersects at most one other and with high probability one of the (pn) 2 /2 node- 
disjoint paths will be nonfaulty. Since there are only a polynomial number of pairs of 
neighbors, each with only a polynomial number of possible sets of live neighbors, this 
communication will be possible with high probability. A series of three broadcasts by 
all nodes will accomplish the task in 0(log 3 N) steps. 

Finally, we will describe how to implement a slight variant of algorithm 3.5. As- 
sume that v has an even number of l's in its bit- vector representation. To avoid 
confusion, to find a simulating path for the edge (v,v k ), only the node v b simulat- 
ing v will actually choose a path. Say that during a step of the algorithm, instead 
of choosing a random dimension in U(v,v k ,b,b'), v b chooses a random dimension 
from {1,2, ...,n}. Then we know (1) the probability that v b chooses any partic- 
ular r 6 U(v,v k ,b,b') does not increase, (2) all sets U{v,v k ,b,b') have cardinality 
/3 P log N with at least the same probability as in algorithm 3.5 and (3) each node v b 
has probability at least f3 p of choosing an r € U(v, v k , 6, b'). 

During any step of this modified algorithm, all of the nodes that have chosen 
an r € U(v,v k ,b,b') succeed with at least the probability stated in the analysis of 
algorithm 3.5. All other nodes may or may not find an unsaturated path and may 
or may not encounter too much congestion. With high probability, if we run the 
modified algorithm 2//3 p times as long as algorithm 3.5, each node v b will choose an 
r € U(v,v k ,b,b') at least as many times as it did in algorithm 3.5. Thus, even if 
nodes never find simulating paths except when they choose anrG U(v,v k ,b,b > ), all 
nodes will find the necessary paths at least as successfully as before. 

Each node that chooses a path attempts to route a message describing the path 
along the path. Any node along the path can send messages back if it detects too 
much congestion along one of its edges. If the message comes back, the even node 
knows that it was unsuccessful. Otherwise, both the even node and the odd node 
which is the message's destination know which path to use in the future. 
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3.7 Extensions and Remarks 

As mentioned in the introduction, edge faults are easily handled once node faults are 
understood. Say each edge fails with probability p e , each node fails with probability 
p n and the failure of any component is independent of the failure of other components. 
Then all results still follow with little change. Specifically, as long as p n +p e — p n p e < 
1 — v^5 (about .13), the algorithms of sections 3.5 and 3.6 work with high probability. 
The only addition to our reasoning is that when one node tries to communicate with 
a neighbor node, it is unsuccessful not only if the neighbor is faulty but also if the 
link between them has failed. 

This work extends to the case in which p is small; that is, if p < N a for > a > — 1. 
In this case, faults are so far between that the results of the second section can be 
strengthened. The deterministic algorithm 3.1 achieves a constant delay embedding 
with high probability. This result follows directly from the following fact. 

Lemma 3.23. If faults occur with probability p for small p then with high probabihty 
no sphere of radius 14 contains more than a constant number of faults. 

Proof. Say p < N a . Then the probability that m nodes out of any given n 14 nodes 
are faulty is no more than 

( U JAT om <N am n Um N 

There are at most N such spheres to consider. If m > — 1/a, the total probability 
that some sphere contains m faults is an inverse polynomial whose exponent can be 
diminished by increasing m. ■ 

Each simulating node only needs to distribute its connections among the dilation 
7 fault-free paths discovered in section 3.4. 

Last, a word about the practicality of the results of this chapter. We have made 
little attempt to optimize constants since we need large constants to obtain the full 
breadth of our results. However, in practice the full strength of these theorems will 
probably be unnecessary. We cannot expect half the processors in a network to fail as 
a normal occurrance. We are optimistic that when the number of faults is moderate, 
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Embedding Trees Dynamically 



4.1 Introduction 

Achieving high performance on a parallel computer requires the satisfaction of two 
potentially conflicting requirements. First, the computational load posed by the 
program should be evenly shared among all processors (load balancing). Second, 
processes communicating frequently should be placed on processors that are close 
(communication locality). 

This problem has been studied abstractly as the problem of embedding a pro- 
cess graph G in a processor graph H ([BCHLR], [BCLR], [BI], [C], [GHR], [HJ], 
[KLMRR]). The vertices of G are processes comprising the parallel program, with 
edges representing communication between processes. The vertices of H are proces- 
sors, and the edges represent communication channels. For many computations, it is 
possible to predict G before execution. In such cases it is useful to map the vertices 
of G into those of H so as to minimize load, dilation and congestion. 

This chapter focuses on embedding arbitrary binary trees into the butterfly and hy- 
percube networks. Trees arise naturally in many computations: divide-and-conquer 
algorithms, branch-and-bound search ([KZ]), functional expression evaluation, and 
image understanding (quad/oct trees). In [BCLR], Bhatt et al. showed that every 
iV-node binary tree could be embedded in an JV-processor hypercube such that each 
processor received a single tree node, and the maximum dilation was 0(1). Embed- 
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ding trees into butterfly networks is harder, because the butterfly is much sparser 
than the hypercube. In [BCHLR], Bhatt et al. showed how to embed the complete 
binary tree with N nodes in a butterfly network with N processors with constant 
dilation and load. The problem of embedding arbitrary trees into butterfly networks 
was left open. 

Tree structured computations are often dynamic. As the computation progresses, 
the tree may grow or shrink, in a manner which may be impossible to predict before- 
hand. In [BC], Bhatt and Cai propose a dynamic version of the embedding problem. 
They consider a process graph which is a binary tree that can grow during execution. 
At each step any node of the tree that does not have two children can request to 
spawn a child. The dynamic embedding problem is harder than the static one since 
newly spawned children must be allocated to processors incrementally, without mak- 
ing assumptions about how the tree will grow in the future. Further, the placement 
decision must itself be implemented within the network in a distributed manner with- 
out accessing global information. The paradigm proposed by Bhatt and Cai disallows 
process migration; i.e. once a process is placed on a particular processor, it cannot be 
moved subsequently. Obviously, allowing migration can potentially give better load 
balancing/dilation but can also be extremely expensive in practice. 

Bhatt and Cai present ([BC]) a randomized algorithm for dynamically growing 
trees with M vertices on an N processor binary hypercube. Each child process is 
placed no farther than a distance O(loglogiV) from its parent. Further, with high 
probability (independent of the tree shape) the algorithm only assigns 0{MjN + 1) 
vertices to each processor. The congestion of the embedding was not determined but 
is probably on the order of log N. 

4.1.1 Summary of Results 

We consider the problem of growing trees on butterfly and hypercube networks. Our 
framework is identical to that of Bhatt and Cai ([BC]), although our growth algo- 
rithms are substantially simpler and have provably better performance. We begin by 
describing a level-by-level strategy for embedding a binary tree in a butterfly. Mod- 
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ifications to this scheme form the basis of all our embedding algorithms. The first 
modification we introduce is the use of random flip bits, which randomize the loca- 
tions of tree nodes within a level of the butterfly. Analysis of the behavior of these 
flip bits is sufficient to prove our first result. 

Theorem 4.1. An arbitrary binary tree T with M vertices can be dynamically 
grown on an N processor hypercube with dilation 1 such that with high probability 
the maximum load per processor is 0(M/N + log N). 

Note that this is optimal to within a constant factor whenever the tree T is large 
(i.e., M > NlogN). For these large trees, it gives an optimal 0(M/N) load as 
in [BC] while improving dilation from O(loglogiV) to 1. Next we present another 
modification of the scheme involving level balancing — in effect, we stretch certain 
paths within the tree so that the number of tree nodes assigned to any level of the 
butterfly is balanced. This modification leads to our next result, this time for a 
butterfly. 

Theorem 4.10. An arbitrary binary tree T with M vertices can be dynamically 
grown on an N processor butterfly network with dilation 2 such that with high prob- 
ability the maximum load per processor is at most 0(M/N + log N) 

Again, this is optimal to within a constant factor when M > NlogN. This result 
is a substantial improvement over previous work since not even good static embed- 
dings of arbitrary binary trees were known. Finally, we take advantage of an embed- 
ding of the butterfly into the hypercube which embeds entire levels of the butterfly 
to subcubes of the hypercube in order to develop a scheme for local redistribution of 
load within levels. This leads to an embedding algorithm for the hypercube which 
simultaneously optimizes maximum load and dilation. In addition, the congestion of 
the embedding is optimal if M = O(N). 

Theorem 4.14. An arbitrary tree T with M vertices can be grown on a N processor 
hypercube with constant dilation such that with high probability the maximum load 
is 0(M/N + 1) and the congestion is 0(M/N + 1). 

It should be noted that although our theorems are phrased in terms of trees 
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which only grow, these embedding algorithms are also effective for dynamic trees 
which can both grow and shrink at their leaves. Consider a binary tree T which 
grows and shrinks. At each stage in the tree's evolution, the probability space of 
possible embeddings of the current form of the tree T" is equivalent to the space of 
embeddings which would have occured had we simply grown the tree T' using the 
same algorithm. Therefore the same results hold for each step in the tree's evolution 
(assuming, of course, that the total number of steps in the tree's evolution is bounded 
by a polynomial in N). 

We also prove a lower bound for deterministic embedding algorithms for hyper- 
cubes which shows that any deterministic algorithm which balances load must nec- 
essarily have dilation £l(\/logN). It follows that any embedding algorithm which 
simultaneously optimizes load and dilation (to within constant factors) must be ran- 
domized. This consequence also holds for the butterfly, since it is a subgraph of the 
hypercube. 

Tom Leighton, Abhiram Ranade and Eric Schwabe coauthored all the work ap- 
pearing in chapter four. 

4.1.2 Overview 

The basic embedding algorithm is presented in section 4.2 along with the introduction 
of flip bits and the proof of theorem 4.1. The level-balancing scheme is introduced 
and analyzed in section 4.3, along with a proof of theorem 4.10. Improvements to the 
hypercube embedding algorithm and proof of theorem 4.14 are given in section 4.4. 
Section 4.5 states and proves the lower bound for deterministic algorithms. 

4.2 The Basic Growth Algorithm 
4.2.1 Preliminary Scheme 

We begin with a level-by-level strategy for growing a tree on an N-node butterfly 
network. For this chapter, we set n so that N = n2 n . That is, the JV-node butterfly 
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has n levels. 

In the cases where we are ultimately interested in an embedding in a hypercube, 
we will first embed the tree in a butterfly, and then consider some embedding of 
the butterfly in the hypercube. We place the root of the tree on processor in 
the butterfly. This processor is connected to two processors in level 1, on which we 
place the children of the root. These processors are in turn connected to 4 level 2 
processors, which will in turn receive the children of the root's children, and so on. 
This strategy enables us to grow any n level binary tree with dilation 1, and with at 
most one tree vertex per butterfly processor. Trees with greater height are wrapped 
around; i.e. level n vertices are placed in butterfly level 0, and so on. The set of tree 
vertices which are mapped to level i of the n level butterfly consists of those vertices 
in levels i,i + n,i + 2n .. . and so on; we refer to this as the i th level set of the tree. 
There are two issues we need to consider: 

1. Evenly distributing tree vertices within the processors in each level. We would 
like the vertices belonging to level set i to be evenly distributed among the 
processors in the i th level of the butterfly; i.e. to guarantee that no single 
processor in level i receives too many vertices. 

2. Evenly distributing tree vertices among different butterfly levels. For example, 
when mapping a complete binary tree of height h, level h — 1 mod n of the 
butterfly would receive the leaves of the tree, or about half the total number of 
vertices. Ideally, we would like the vertices to be divided evenly among all the 
levels of the butterfly. 

We will defer our consideration of the second issue until section 4.3. First, a 
modification of the basic scheme helps us achieve balance within a level. 

4.2.2 Flip Bits 

A random flip bit is generated at each vertex of the tree to decide where its children 
will be spawned. Consider a vertex v of the tree that has been placed on some 
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processor p in level i of the butterfly. This node is connected to processors q and r in 
level i + 1 mod n, which will receive the children of v. The flip bit chosen for vertex v 
decides whether the left child of v will be placed on q or on r. The right child is then 
placed on the other processor. Note of course that it is not necessary that v have two 
children - the bit only determines where the children will be placed if they are ever 
spawned. 

In section 4.3 we will show that this ensures even distribution within each level. 
Intuitively, each vertex is effectively placed using a random path determined by the 
flip bits chosen along its ancestors. For now, this modified scheme is sufficient to 
prove the' rem 4.1. 

Theorem 4.1. An arbitrary binary tree T with M vertices can be grown dynamically 
on an N processor hypercube with dilation 1 such that with high probabiHty the 
maximum load per processor is 0(M/N + log N). 

Theorem 4.1 follows directly from the following lemma. 

Lemma 4.2. An arbitrary tree T with M vertices can be grown in a butter£y net- 
work of N processors such that each column in the butterfly receives no more than 
0(M/2 n + n) vertices with high probability. 

Suppose this lemma were true. Then by simulating the N = n2 n -node butterfly by 
a 2 n -node hypercube, where each node of the hypercube simulates an entire column 
of the butterfly, we have an embedding algorithm for the hypercube which achieves 
dilation 1 and load 0(M/N + log N) with high probability. Thus this lemma is 
sufficient to prove theorem 4.1. 

The general idea behind the proof of lemma 4.2 is that a large number of vertices 
will be placed in the same column in the butterfly only if the flip bits on the paths 
leading to these vertices are chosen in a specific (unlikely) manner. 

A stagnant path p is a maximal path v(l),u(2), . . . ,v(l) in T with v(l) towards 
the root such that all v(i) are placed in the same column v of the butterfly. Let the 
leader of p be the n th ancestor of u(l), and the trace of p be the set of n + 1— 1 vertices 
between the leader (inclusive) and v(l) (exclusive). If v(l) is in the first n levels of 
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the tree, then the leader of the path is defined to be the root of the tree. 

Notice that there is a unique path in the butterfly from the leader of a stagnant 
path p to vertex u(l). Thus, given the column in which the leader lies, and the column 
in which the path p lies, we can completely determine the flip bits chosen along the 
trace of the path. The next observation is that the traces of distinct stagnant paths 
mapped to the same column are distinct; i.e. the information gained from one trace 
is different from that obtained in the other. 

Lemma 4.3. Let p and p' be two distinct stagnant paths placed in the same column 
of the butterfly. Then their traces are vertex disjoint in the tree. 

Proof. Contrary to the lemma, suppose the lowest point in the tree at which the 
traces intersect is vertex u. At vertex u, the two traces are mapped to the same 
column of the butterfly. Likewise, the two stagnant paths are mapped to the same 
column. The two children of u are mapped to different columns of the butterfly, 
however, and therefore the traces must reconverge in some butterfly column between 
the children of u and the beginnings of the two stagnant paths. However, the two 
paths cannot meet again in any column until they have traversed all n levels of the 
butterfly. Since the two stagnant paths are at a distance less than n from u, the traces 
cannot reconverge in the butterfly before reaching them, and we have a contradiction. 
■ 

Lemma 4.4. For any column v of the butterfly, there is at most one stagnant path 
mapped to v such that v(l) is in the first n levels of the tree. 

Proof. This lemma follows immediately from lemma 4.3 by noting that any two such 
paths will have the same leader (the root of the tree). I 

Proof, (of Lemma 4.2) We shall count the number of different settings of the flip 
bits that give rise to some column having at least C = k(M/2 n + n) tree vertices. 
This can be done as follows: 

1. Choose the column: 2 n choices. 

2. Choose the number of stagnant paths: C choices. 
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3. Choose the endpoint of each path: f c J , where Co is the number of stagnant 
paths. Define = C /Co. 

4. Choose the length of the paths: ( C £j choices. 

5. Choose the flip bits at all vertices in T except those in the Co traces. The total 
number of flip bits is M, and the length of the jth trace is n + /,- — 1, except 
for the possible case when one stagnant path has v x in the first n levels of the 
tree, in which case the length of its trace is /j — 1. Thus the total number of 
bits this step fixes is: M - £(n + /,- - 1) + n = M - {C {n - 1) + C) + n. Thus 
the total number of choices is 2 A/ -( c °( n - 1 ) +c ) +n . 

First we claim that the above choices completely determine all the flip bits. To 
see this, consider the trace with its leader belonging to the smallest level in T, of all 
traces. Clearly, the last step of the above procedure fixes the position of the leader. 
This fixes all the bits in the trace, since the endpoint and the length of the trace are 
known. The bits for the other traces are similarly determined. 

The total number of ways of choosing all the bits is 2 M . Thus the probability that 
some column gets more than C vertices is at most 



< 

< 



2 2n c( M ) ( c + c °')2 Af "" (c ' ( n " 1)+c V2 Af 

<■ 2 2n C ( M( c +Co) e3 ) C ° 2-(Co(n-l)+C) 

2 2n c\ 2e3 ^ Co) ) C ° 

2 2n C (M°^ii) C ° 

To go from the first line to the second we have used the inequality ("J < (ne/r) r . 
Choosing k > 10e 2 , and noting that 0(0 + 1) < 5(2 /J/2 ), we can simplify the above 
expression to: 

2 2n C(^) Co 

< 2 2n C2- c/2 

< 2- c '* 

< 2 _fcn/4 
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Figure 4-1: Level balancing a tree, n=6. The numerical labels indicate the stretch 
counts chosen at those nodes. White nodes indicate dummy vertices. 
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4.3 Embedding in the Butterfly 

In this section we introduce a modification to the embedding algorithm which insures 
that with high probability the nodes of the binary tree are distributed evenly among 
the levels of the butterfly. We then prove that the flip bits described in the previous 
section are sufficient to distribute the tree nodes evenly within each level. 

4.3.1 A Level-Balancing Transformation 

We transform the tree T being grown by selectively inserting dummy vertices into 
some of its edges during the growth. Even if some level originally has a dispropor- 
tionately large number of vertices, the newly introduced vertices help to even the 
distribution of the tree vertices among the levels. 

The n-way level balancing transformation is as follows. Define a vertex of T to 
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be distinguished if it lies in level i = (mod n/3). 1 For each distinguished vertex v 
in T we pick a random number S(v) between and n/3 called the stretch count. We 
insert a single dummy vertex in each of the edges that connect v to its descendants 
in levels i + 1 through i + S{v). Figure 4-1 illustrates the transformation. Note 
that this transformation can be applied as the tree grows. Each node only needs to 
know what level of the tree T it belongs to, and the stretch count generated at its 
nearest distinguished ancestor. This is sufficient information to decide whether or not 
a dummy vertex is inserted when a child is spawned. 

The new tree B(T) that results is grown on the butterfly using the procedure 
described in section 4.2.2. This gives a dilation 1 embedding for B(T). This corre- 
sponds to a dilation 2 embedding of T, since some of the edges in T were replaced by 
two edges in B(T). 

4.3.2 Analysis of Tree Balancing 

We show that the n-way level balancing transformation of section 4.3.1 is sufficient 
to evenly distribute the tree vertices among the levels in the butterfly. In particular, 
we show that for any tree T, no level set in B(T) will contain a disproportionately 
large number of vertices. Since level i of the butterfly receives vertices from the i th 
level-set of B(T), this implies that tree vertices are uniformly distributed among the 
butterfly levels. 

Lemma 4.5. For an arbitrary tree T, the n-way level-balancing transformation gives 
a tree B(T) such that the total number of vertices in the i th level-set of B(T) is at 
most 0(M/n + 2 n ) with high probability. 

We will prove the following slightly modified (but equivalent) version. Define the 
i th level set triple of a tree to be the set of vertices from level sets i, i + n/3 and 
i + 2n/3. Define a partition of T into 3 zones as follows (Figure 4-2). Zone consists 
of vertices in levels kn through kn + n/3 — 1. Zone 1 consists of vertices in levels 



1 In what follows we may make references like "(mod x) n or "contribution of x messages" when 
x may not be integral. Rounding these quantities to integers does not affect the correctness of the 
proof. For ease of exposition, we shall not consider the issue. 
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Figure 4-2: Subdivision into Zones, and a forest /,. 

kn + n/3 through kn + 2n/3 — 1. Zone 2 consists of vertices in levels kn + 2n/3 
through (k + l)n — 1. Each zone consists of a number of trees of maximum height 
n/3. We will show that no level set triple of B(T) will receive more than 0(M/n) 
vertices from any zone of T, with high probability. Lemma 4.5 follows because there 
are only 3 zones, and since the number of vertices in a level set triple upper bounds 
the number of vertices in a level set. 

The key observation is that each zone can be partitioned into a set of forests 
/i , fi, • ■ • > fa tnat contribute independently to level set triple i, for any i. We illustrate 
the partitioning for zone 1. Each /,- consists of all trees from zone 1 between levels 
kn + n/3 and kn + 2n/3 - 1 that have a common ancestor r, at level kn, for some 
fixed k. Other zones are partitioned similarly. 

Lemma 4.6. Let Xj denote the number of zone 1 vertices from a forest /, placed 
in level set triple i of B(T). Then all variables Xj are mutually independent, and 
E(Xj) = 3Mj/n, where Mj is the number of vertices in fj. 

Proof. Let variable Yj denote the level set triple into which the roots of the trees 
in /_, are placed. By definition, these roots are all placed in the level set triple given 
by the level set triple of Tj plus S{rj), mod n/3. Since the stretch counts of the r/s 
are uniformly selected from [0, n/3] and are mutually independent, it follows that the 
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Vj's are also uniformly selected from [0,n/3] and are mututally independent. Since 
Xj is completely determined by Yj and the stretch counts chosen at the roots of trees 
in fj, it follows that the Xj are mutually independent, and that E(Xj) = 3Mj/n. ■ 

Similarly, this lemma holds for any other zone of the tree T, except for the first 
section of zone 0, which contains the vertices in levels 0. . . n/3 — 1. However, this 
segment of the tree contains at most 2 n / 3 — 1 nodes, which will be mapped one-to-one 
to nodes of the butterfly. 

Proof, (of Lemma 4.5) The Xj are independent random variables. Clearly, no Xj 
can contribute more than 2 2n/ ' 3 vertices, since the forest is part of a tree of height no 
more than 2n/3. The mean of each Xj is 3Mj/n, where Mj is the number of vertices 
in fj; therefore the mean of X{= £ Xj) is at most £ Mj < 3M/n. We have by the 
independence of the Xj that for any t, E[e tX ] 

IL E[e tx '] 

= ILEa^M* = *]«'* 
As in lemma 2.7, the expectation is maximized when only the events [X, = 0] and 
[Xi = 2 2 "/ 3 ] have positive probability. Suppose there were some value x, not equal to 
or 2 2n / 3 , such that Pr[Xi = x] = 8 > 0. Then by the convexity of e tXi , changing 
Pr[X{ = x] to and setting Pr[Xi = x — 1] = Pr\X{ = x + 1] = 6/2 would increase 
the expectation of e iXi . It follows that in order to maximize the expectation, the two 
endpoints of the interval must be the only events with positive probability. If we use 
Markov's inequality to put an upper bound on Pr\Xi — 2 2n/ ' 3 ] then 

= n(>+f£v"-u) 

* -p (^(e-» - !)) 
Again using Markov's inequality, we obtain for any constant 6, Pr[X > ZbM/n] 

= Pr[e tx > e 36Mt / n ] 
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This quantity is minimized at i = lnb/2 2n ^ 3 . At this value of £, and as long as 
M > n 2 2 2n / 3 , this quantity is smaller than N~ k for some constant k which can be 
made as large as desired by choosing b sufficiently large. I 

4.3.3 Effectiveness of Flip Bits 

We now show that, given the effectiveness of the level-balancing algorithm, the flip 
bits suffice to distribute the tree nodes within the levels of the butterfly. 

Lemma 4.7. Let Wi denote the total number of vertices in level set i in an arbitrary 
binary tree T. When T is grown on a butterSy with n levels, no processor from any 
level i receives more than 0(Wi/2 n + n) vertices with high probability, for all i. 

In other words, whenever Wi > n2 n , each of the 2 n processors in level i will receive 
roughly the same number of tree vertices. 

The key to the proof is the observation that the vertices placed on a processor 
can be attributed to a large number of mutually independent sources. To see this, 
partition T into subtrees T\,T 2 , ... where each subtree is rooted at some vertex in 
level kn + i and consists of all the descendants of the that vertex between levels 
kn + i '. + 1 and kn + i + n (figure 4-3). 

Lemma 4.8. At most one level n vertex from each subtree Tj will be placed on any 
processor p on level i of the butterSy. The probability of a vertex from Tj being placed 
on processor p is Wj/2 n , where w, denotes the number of vertices in level n of tree Tj. 
Further the contributions of the different subtrees to p are mutually independent. 

Proof. Any tree Tj can have at most 2 n vertices at level n, and the growth algorithm 
guarantees that these will be placed on distinct processors within a single level. Thus 
we know that at most one vertex from a tree Tj will be placed on a given processor 
p in level i of the butterfly. 

It follows from the above that the number of vertices from Tj placed on p is a 
random variable with value either or 1. The probability that any given vertex from 
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Figure 4-3: The tree T and its partition, i = 1, n = 2. 

level n of 2j will be placed on p is l/2 n , so the expectation of this random variable 
is Wj/2 n . Since the value of the random variable can only be or 1, Wj/2 n must be 
the probability that it is 1. Thus the probability of a vertex from Tj being placed on 
p is w 3 /2 n . 

The independence between different subtrees follows because the flip bits in each 
subtree are picked independently. I 

To complete the proof of lemma 4.7, we need the following lemma, due to Hoeffding 
([H]). 

Lemma 4.9. [Hoeffding] If we have L independent Bernoulli trials with respective 
probabilities p\ , . . . , pl , with Lp = £ p t - , and m > Lp + 1 is an integer, the probability 
of at least m successess is at most B(m,L,p), where B(m,L,p) < (Lpe/m) m . 

Proof, (of Lemma 4.7) The number of vertices placed at a processor is the sum of 
independent random variables corresponding to each tree Tj. The expected number 
of vertices is E^i/2" = Wj/2 n . The probability that some processor receives more 
than k(n + Wj/2 n ) vertices is at most (using lemma 4.9): 

{ eW 3 l2" ) K ' 

Thus the probability that one of the 2 n processors in any of the n levels receives more 
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than k(Wj/2 n + n) vertices is at most 

n2 n (k/e)~ kn < N~ klo * k/kl 
for some constant k\. I 

Theorem 4.10. An arbitrary binary tree T with M vertices can be grown dynami- 
cally on an N processor butterfly network with dilation 2 such that with high prob- 
ability the maximum load per processor is at most 0(M/N + log N). 

Proof. By lemma 4.5, with high probability we have Wi = 0(M/n + 2 n ) for all i, 
and by lemma 4.7 , with high probability no processor in level i will receive more 
than 0(n + Wi/2 n ) vertices. Thus with high probability, fewer than ( log iV + M/JV) 
vertices are mapped to any processor. I 

4.4 An Improved Hypercube Embedding 

The butterfly can be embedded in the hypercube with dilation 2 such that each level 
of the butterfly is a subcube of the hypercube. Therefore we can have the hypercube 
simulate any embedding algorithm for the butterfly, with a unique 2 n -node subcube 
simulating each level. We will take advantage of this by using a scheme which has each 
level (subcube) receiving only 0(M/n + 2 n ) tree nodes, and developing a method for 
local distribution within these subcubes which will reduce the load on each indvidual 
processor while guaranteeing low congestion. We begin with some preliminaries. 

4.4.1 Embedding the Butterfly and Star Covers 

Let G(x) be the Grey code value of the binary string x, defined by G(xiogn • • • ^i) = 
ziogn|*iogn©3iof n-i| • ■ ■ |*2©zi- For any bit string x, G(x) and G((x + 1) mod n) differ 
in exactly one bit position. For an integer i, let bin(i) be the binary representation of 
i. The embedding which maps butterfly processor v\ to node G(bin(l))\bin(v) of the 
hypercube has dilation 2 and maps each level of the butterfly to a distinct 2"-node 
subcube of the hypercube. Also note that within each level /, if v and v k differ in 
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exactly one bit, then there is a hypercube edge between the embedded locations of 
the nodes u; and vf. 

For any node x of a 2"-node hypercube, we define the full star centered at x to be 
the set of nodes consisting of x along with the n nodes adjacent to x. The existence 
of perfect one-error-correcting codes implies that when n = 2 m — 1 for some integer m 
there exists a collection of 2"/n + 1 full stars such that every node of the hypercube 
belongs to precisely one star in the collection. 

Suppose n is not of this form. Consider the largest n' such that n' < n and n' is of 
the form n' = 2 m — 1; then n' > n/2. We can partition the hypercube into subcubes 
of 2" nodes, and cover each of these with full stars. This star cover perfectly covers 
the nodes of the 2 n -node hypercube. Each star in the star cover consists of a node x 
and some subset of 0(n) (in this case at least ^) of its neighbors. 

Choose a star cover for a 2 n -node hypercube, and duplicate this cover in each 
subcube of the N(= n2")-node hypercube which corresponds to a level of the butterfly. 
This collection of stars yields a star cover of the iV-node hypercube; call it C. 

4.4.2 Modifying the Embedding Algorithm 

Our discussion of the hypercube algorithm has two parts: 

1. We describe a modified algorithm for embedding on the butterfly which, when 
simulated on a hypercube, maps at most 0(M/2 n + n) tree nodes to any star 
in the cover C, with high probability. 

2. We show how to deterministically redistribute the load within a star of the 
hypercube among its nodes in such a way that each node receives 0(M/N + 
1) tree nodes, the dilation remains constant and the resulting congestion is 
0{M/N+1). 

We begin by showing how to modify the butterfly embedding algorithm given in 
the previous section so that when it is simulated on the hypercube, the amount of 
load assigned to any star in the cover C is balanced. 
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We will modify our embedding algorithm as follows. Use the embedding algorithm 
from the previous section, but where previously we placed the children of a tree node 
v € B(T) which was embedded in level / into level / + 1, choosing their locations by 
a random flip bit, we will now place the first child of v into level / + 2, using a pair 
of flip bits to determine its position within the level, and placing the second child (if 
it exists) at the location in that level determined by complementing both flip bits. It 
is clear that this will increase dilation by a factor of two. 

Since we are embedding the level- balanced tree B(T), we know that, with high 
probability, each level-set of the tree contains 0(M/n + 2 n ) nodes. As in lemma 4.6, 
we observe that the vertices placed in a single star come from many independent 
sources. 

Partition B(T) into subtrees 7\, Ti . . . in such a way that the root of each subtree 
is embedded at level / + 2 in the butterfly (or level / + 1 if n is odd) and each subtree 
contains the descendants of its root down to the nodes embedded at level / in the 
butterfly. 

Lemma 4.11. Consider an arbitrary star S in C, contained in level I of the butter- 
By. Then at most two vertices from each subtree can be placed on processors in S. 
Furthermore, the contributions of each subtree to S are mutually independent. 

Proof. Any subtree can have at most 2 n/2 vertices placed in level / of the butterfly, 
and these will necessarily be placed at distinct locations within the level. Suppose 
that three vertices from the same subtree were mapped to the star S. Since the flip 
bits are chosen in pairs, any pair of these vertices must be mapped to locations which 
differ in an even number of bits; since they are all mapped to the same star, any pair 
of them must differ in exactly two bit positions. Consider the paths to each of these 
three vertices from their lowest common ancestor; call this vertex x. Clearly, two of 
the vertices must be descendants of one child of x, and one must be a descendant of 
the other. The vertex (call it y) which is the lone descendant of one of the children 
of x now differs from both of the other two vertices in two bit positions which are 
not corrected elsewhere in the tree. However, at some point the paths of the other 
two vertices diverge (since they are placed on different processors in level /), and y's 
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path cannot duplicate the flip bits on both paths simultaneously. Therefore y differs 
from one of the other two vertices in at least four bit positions, contradicting the 
supposition that all three vertices were in the same star in level /. Therefore at most 
two vertices from the same subtree can be placed in the star S. 

The independence between different subtrees follows from the fact that the flip 
bits are picked independently in each subtree. I 

Lemma 4.12. We can embed an arbitrary binary tree T with M nodes into an N- 
node hypercube such that, with high probability, no star in the cover C receives more 
than 0(M/2 n + n) tree nodes. 

Proof. Consider an arbitrary star S in level / from the cover C. Let X, be the 
number of tree nodes from subtree T, which are assigned to processors in S. The X{ 
are independent random variables, each with maximum value 2 (from lemma 4.11) and 
mean 0(min/2 n ), where m,- is the number of leaves of the subtree T{. It follows that 
the mean of X = £ Xi is 0(mn/2 n ), where m is the number of tree nodes embedded 
into level i of the butterfly. But since we are balancing levels by embedding the tree 
B(T), we have m = 0(M/n + 2"), so that the mean of X is less than Ci 8 (M/2 n + n) 
for some constant ci 8 > 0. By the same argument as in the proof of lemma 4.5, we 
can bound the expectation of the random variable e tX by 

£[e tX ]<exp(^(M/2" + n)(e 2 '-l)) 

Again as in lemma 4.5, we obtain for any constant 6, 

Pr[X > ^(M/2 n + n)] 
= Pr[e tX > c (**ci § /a)(Af/2-+n)j 

< e»P«d,/2HM/a»+nU«»-l)) 

i: e i*c 1 ,/a<>/3"+ni 

This value is minimized at t = lnb/2, at which point this quantity is smaller than 
N~ k for k which can be made as large as desired by choosing b sufficiently large. ■ 
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4.4.3 Redistributing Load Within Stars 

With high probability, each star in the cover has at most 0(M/2 n + n) tree nodes 
assigned to its 0(n) nodes. We would like to redistribute the 0(M/2 n + n) load 
on each star evenly among the 0(n) nodes of the star, using the hypercube edges 
connecting butterfly nodes within the same level, so that two conditions hold: 

1. Each node gets at most 0(M/N + 1) load. 

2. We can choose paths of constant length between the redistributed locations of 
adjacent tree nodes so that the congestion on any hypercube edge is at most 
0(M/N+l). 

If these two conditions can be achieved by a redistribution scheme which runs 
dynamically as the tree is embedded then, with high probability, the embedding 
algorithm achieves load 0(M/N+1), dilation 0(1), and congestion 0(M/N+1) — 
simultaneously optimizing load and dilation to within constant factors. In addition, 
the congestion will be optimal if M = 0(N). 

Place an 0(M/N + 1) upper limit (with appropriate choice of constant depending 
on the constant in lemma 4.12 and the number of elements in each star) on the number 
of tree nodes which can be assigned to a single node. All additional load is sent to 
some other node in the star which has room. It is clear that we have sufficient capacity 
over each star to handle the load, and that we will still have constant dilation. In 
addition we will have maximum load 0(M/N+l) at each node of the hypercube. Note 
that this is not allowing process migration — each tree node is redistributed before it is 
embedded into the hypercube. Once the node's redistributed location is determined, 
it is embedded there permanently. 

Suppose we redistribute one tree node from node v % to node v* in the star centered 
at v in the hypercube (load coming from or going to the center is redistributed 
directly). This load is passed along the path v* -► v i} -+ v* rather than through the 
center of the star. 
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Lemma 4.13. If all load being redistributed among points of stars is follows paths 
of the form v x — ► v ,J — ► v J rather than paths through the centers of stars, then the 
resulting congestion due to this redistribution is 0(M/N + 1). 

Proof. For each star in the cover, consider the corresponding extended star, which 
consists of the star centered at v, plus all vertices u ,J such that both v l and v J are in 
the star. The edges in the extended star consist precisely of those paths along which 
load can be redistributed in the star centered at v. The redistribution within that star 
can add at most congestion 0(M/N + 1) to any of the edges in the extended star. All 
that remains is to observe that any edge in the hypercube is in at most two extended 
stars. Thus the total congestion it receives from redistribution is 0(M/N -f 1)- B 

Let / be the level of the butterfly to which u is mapped; then v is mapped to level 
/ + 2. Furthermore, their positions within their respective butterfly levels differ in at 
most two bit positions (before redistribution). We consider here the case where both 
u and v are both initially mapped and redistributed to some point of a star rather 
than the center. When one or both of them is mapped to the center of a star, the 
argument is even simpler. 

Let x and y be the centers of the stars to which u and v, respectively, are mapped. 
Let p and q be the dimensions within the star to which u is mapped and redistributed, 
and likewise r and s for v. Let /i,/ 2 be the flip bits selected when v is embedded as a 
child of u. We then define the path from u, which is redistributed to x q in level /, and 
v, which is redistributed to y' in level / + 2, as follows (this procedure is illustrated 
in figure 4-4): 

1. Move from level / to level / + 1 to level / + 2 along the edges determined by the 
flip bits/i,/ 2 . 

2. Flip the bits in positions p, then q, in effect undoing the redistribution of u 
which was performed in level /. We are now at y T , the original location of v 
before redistribution. 
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Figure 4-4: The path chosen between redistributed node locations. The dashed lines 
indicate the path determined by the flip bits, before redistribution. The first pair 
of directed edges also show this choice of flip bits. The second pair undoes the 
redistribution at level /. The last pair balances the load at level / + 2. 

3. Flip the bits in positions s, then r. This takes us to y", the redistributed location 
of v in its star in level / + 2. 

In order to show that the congestion is 0(M/N+l) in this case, it suffices to show 
two things. First, that the congestion along each edge of the butterfly is 0(M/N+l). 
Second, that the congestion along each hypercube edge connecting nodes within a 
butterfly level is 0(M/iV+l). From these two facts it follows that the total congestion 

isO(M/JV + l). 

Consider an arbitrary butterfly edge. There are at most two nodes of the butterfly 
which, when choosing the paths to their descendants, can use that edge. Since after 
redistribution each of these nodes has load 0(M/N + 1), the congestion along the 
edge being considered can also be at most 0(M/N + 1). 

The congestion on hypercube edges connecting butterfly nodes within a level has 
two sources: (1) the redistribution of the nodes embedded to that level, and (2) 
undoing the redistribution of the parents of the nodes embedded to that level. 

It follows directly from lemma 4.13 that the total congestion from the first source 
does not exceed 0(M/N + 1). We can break up the congestion derived from the 
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second source into four sets, according to the flip bits chosen along the paths from 
the parents of the nodes in the level we are considering. Each fixed setting of flip 
bits determines a bijective map of the nodes, and therefore the jump edges, from 
two levels above to the current level. The congests on any edge from undoing 
the redistribution of parents equals the congestion on its preimage from the original 
redistribution. The congestion in each set is therefore 0{M/N + 1), and so the total 
congestion derived from undoing the redistributions is also 0(M/N + 1). It follows 
that the entire congestion on any edge is 0(M/N + 1). 

Theorem 4.14. An arbitrary tree T with M vertices can be grown on a N processor 
hypercube with constant dilation such that with high probability the maximum load 
is 0(M/N + 1) and the congestion is 0(M/N + 1). 

4.5 A Lower Bound for Deterministic Algorithms 

In this section, we prove that any deterministic algorithm for dynamically embedding 
an M-node tree in an .V-node hypercube (M > N) which maintains maximum load 
af must have not only maximum but average dilation n( v /TBglv'/a 2 ). It follows 
that any deterministic embedding algorithm which achieves 0(M/N + 1) load must 
necessarily result in embeddings with dilation (1(^7?) for some trees. Thus any 
embedding algorithm which simultaneously optimizes maximum load and dilation (to 
within constant factors) must be randomized. 

Theorem 4.15. Any deterministic algorithm for dynamically embedding trees in an 
N-node hypercube which achieves load aM/N for a tree with M (> N) nodes must 
have average edge length Qi^/IogN/a 2 ). 

Proof. Let aM/N be the load maintained by the embedding algorithm when embed- 
ding an M-node tree. Define the size of a node in the hypercube to be the number 
of l's in the n-bit string associated with the node. Partition the hypercube into 6a 
blocks, each block corresponding to some range of node sizes and containing N/6a 
nodes. Since there are at most 0{N/y/fi£N) nodes of any size, each block must 
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contain at least Q(>/log N/a) sizes. This means that any two nodes which are in 
non-adjacent blocks are at distance fi(\/log N/a) from each other. 

Choose an arbitrary M > N, and grow a path of M/2 nodes, starting at the root. 
At this point, some block must contain MjYla tree nodes; choose such a block. We 
will continue growing the tree from the MjYla nodes in the chosen block. Grow paths 
from each of these tree nodes simultaneously, stopping each path's growth when it 
reaches a hypercube node which is neither in the chosen block nor in a block adjacent 
to it. The total number of nodes in the chosen block and adjacent blocks is at most 
N/2a; since the algorithm maintains load aM/N, this set of nodes contains at most 
(aM/N)(N/2a) = M/2 tree nodes. It follows that the total length of the Mj 12a 
paths grown is at most M/2. This verifies that the tree being considered has at most 
M nodes. 

Now we can calculate the average edge length. Since each of the M/12a paths 
connects a node in the chosen block to a node in some non-adjacent block, the total 
edge length in these paths is at least (M/12a) x Q,(y/log N/a) = fl(M y/log N/a 2 ). 
Since the entire tree contains at most M edges, it follows that the average edge length 
of the embedding is Ct(y/\og N/a 2 ). ■ 

4.6 Remarks 

The embedding in section 4.4 achieves dilation at most 12. One edge of T corresponds 
to at most two edges of B(T), each of which corresponds to two butterfly edges. In 
the embedding of the butterfly into the hypercube each butterfly edge corresponds to 
two edges of the hypercube. The redistribution algorithm adds at most four edges to 
the resulting path for a total of 12 hypercube edges. By combining the techniques of 
section 4.4 with those of section 4.2, we can reduce this to 6 or 7 with no increase in 
load or congestion. 

It is also likely that we can improve the bound on congestion to 0(M/N log N + l) 
for hypercube embeddings by combining the techniques in section 4.4 with those of 
section 4.2. We suspect that this bound is tight for all on-line algorithms, but we 
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can prove a bound of $l(M /N log N + 1) only for deterministic on-line algorithms. 
Any M-node binary tree can be embedded off-line in an iV-node hypercube with load 
0(M/N + 1) and constant dilation and congestion. 

Although we have not worked out the details, we suspect that our embedding 
algorithms also work for trees that can shrink from the top as well as grow and shrink 
from the bottom, and that they can be made to work for arbitrary trees of small 
degree. We also expect that our techniques will prove useful for finding embeddings 
in other networks, such as the shuffle-exchange graph. 
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